# INTEGRATING VISUAL SYSTEM MECHANISMS, COMPUTATIONAL MODELS AND ALGORITHMS/TECHNOLOGIES

EDITED BY : Hedva Spitzer, Xavier Otazu and Hagit Hel-Or PUBLISHED IN : Frontiers in Bioengineering and Biotechnology, Frontiers in Neuroscience, Frontiers in Computational Neuroscience and Frontiers in Psychology

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-510-8 DOI 10.3389/978-2-88963-510-8

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# INTEGRATING VISUAL SYSTEM MECHANISMS, COMPUTATIONAL MODELS AND ALGORITHMS/TECHNOLOGIES

Topic Editors:

Hedva Spitzer, Tel Aviv University, Israel Xavier Otazu, Autonomous University of Barcelona, Spain Hagit Hel-Or, University of Haifa, Israel

Illustration by Rony Griffit

Citation: Spitzer, H., Otazu, X., Hel-Or, H., eds. (2020). Integrating Visual System Mechanisms, Computational Models and Algorithms/Technologies. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-510-8

# Table of Contents

*05 Editorial: Integrating Visual System Mechanisms, Computational Models and Algorithms/Technologies* Hedva Spitzer, Xavier Otazu and Hagit Hel-Or

*08 Characterization of Spatial Frequency Channels Underlying Disparity Sensitivity by Factor Analysis of Population Data* Alexandre Reynaud and Robert F. Hess


Oscar J. Avella Gonzalez and John K. Tsotsos


Hadar Cohen-Duwek and Hedva Spitzer


Marina Martinez-Garcia, Marcelo Bertalmío and Jesús Malo


Hadar Cohen-Duwek and Hedva Spitzer

*182 A Cross-Recurrence Analysis of the Pupil Size Fluctuations in Steady Scotopic Conditions*

Pietro Piu, Valeria Serchi, Francesca Rosini and Alessandra Rufa

*192 Bio-Inspired Presentation Attack Detection for Face Biometrics* Aristeidis Tsitiridis, Cristina Conde, Beatriz Gomez Ayllon and Enrique Cabello *209 Scene Regularity Interacts With Individual Biases to Modulate Perceptual Stability*

Qinglin Li, Andrew Isaac Meso, Nikos K. Logothetis and Georgios A. Keliris

*220 Reconciling Color Vision Models With Midget Ganglion Cell Receptive Fields*

Sara S. Patterson, Maureen Neitz and Jay Neitz

# Editorial: Integrating Visual System Mechanisms, Computational Models and Algorithms/Technologies

Hedva Spitzer <sup>1</sup> \*, Xavier Otazu<sup>2</sup> and Hagit Hel-Or <sup>3</sup>

*<sup>1</sup> School of Electrical Engineering, Tel Aviv University, Tel Aviv, Israel, <sup>2</sup> Computer Science Department, Computer Vision Center, Autonomous University of Barcelona, Barcelona, Spain, <sup>3</sup> Department of Computer Science, University of Haifa, Haifa, Israel*

Keywords: computational models, algorithms, technologies, visual system, mechanisms

**Editorial on the Research Topic**

#### **Integrating Visual System Mechanisms, Computational Models and Algorithms/Technologies**

The Research Topic on "Integrating Visual System Mechanisms, Computational Models and Algorithms/Technologies" collects novel studies that display a strong synergy between three entities: (1) the visual system from its various angles including physiological, psychophysical, and perceptual, (2) computational models whether descriptive or predictive, and (3) vision inspired algorithms and applications. The interaction between modeling and the various aspects of the visual system is expressed in the reciprocal contributions between the two. On one hand, visual mechanisms and neuronal units provide inspiration and basis for modeling approaches and their computational units within, and on the other hand, modeling provides novel insights and new understandings of the visual system mechanisms and its associated behaviors. Furthermore, computational models, and the underlying visual mechanisms, provide a basis for developing practical algorithms to perform image processing and image understanding.

The articles in this Research Topic present computational models of the visual system ranging from neuronal mechanisms, through visual mechanisms, to visual perceptual behavior and visual illusions. Modeling efforts take different computational approaches from building blocks that are inspired by mechanisms of the visual system, to a more global Gestalt approach that attempts to explain a phenomenon regardless of the underlying elements using functional, statistical, or learning approaches. Other articles develop applications ranging from visual system inspired measures such as image quality and image esthetics to applications such as classification and segmentation.

Several studies in this issue, present computational models of the visual system at the neuronal level, and some include feasible physiological components in the model. In Gonzalez and Tsotsos, the authors suggest a computational model of attention based on the adaptation mechanisms and selective tuning of the V4 neurons which is expressed in the neurons' firing rate during attentional tasks. Different computational models are tested, coinciding with different interpretations of the attention mechanism: (a) enhancing responses due to attention or (b) suppressing irrelevant signals. The authors follow a model of the second type and are able to predict the temporal profiles of neurons' firing rate, similar to those found electrophysiologically. Through their modeling, the authors show that high level vision processes can also be explained by low-level processes, namely, that selectively tuning a model of attention, can reprsoduce properties of neuron firing rates related to attention. In another article Banerjee et al., the authors propose a computational model, based on the extreme value theory, for the integration of two sensory modalities, namely, the olfactory input and visual sensitivity of zebrafish. The authors show that the neural signals (pattern and rate of neuronal firing) differ in their statistical fit when the signals are uni-modal (visual) or multimodal (visual + olfaction). They further showed this by developing a Machine Learning based

Edited and reviewed by:

*Richard D. Emes, University of Nottingham, United Kingdom*

> \*Correspondence: *Hedva Spitzer hedva@eng.tau.ac.il*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology*

Received: *21 November 2019* Accepted: *27 December 2019* Published: *22 January 2020*

#### Citation:

*Spitzer H, Otazu X and Hel-Or H (2020) Editorial: Integrating Visual System Mechanisms, Computational Models and Algorithms/Technologies. Front. Bioeng. Biotechnol. 7:483. doi: 10.3389/fbioe.2019.00483* classifier that was able to successfully distinguish between these neural signals. This study forms a contribution to the intriguing area of interactions between different sensory modalities.

Two additional articles deal with the chromatic properties of the visual system as expressed in the retinal layer and cortical layers. In Barkan and Spitzer, a computational model is presented which suggests an explanation of the underlying visual mechanisms for compensating chromatic aberrations. The computational model takes into account the spatio-chromatic properties of the color-coded cells in the retina while taking into account the significance of the anatomical separation of the Konio and Parvo chromatic pathways in the visual system. Furthermore, the model predicts the enigmatic phenomenon of S-cone pattern reported by Shevell and Monnier. In a review article, by Patterson et al., the authors discuss the role of retinal midget RGC cells and cortical double opponent cells in the context of hue perception on one hand and spatial perception on the other. The authors present hypotheses that in some form are not in accord with those supported by some other models including that of Barkan and Spitzer mentioned above. As usual in Science, especially in neuroscience, conflicting results are always an interesting source for promotion of discussion and comparison of opposite/different ideas.

Another group of studies develop computational models in order to assist in understanding specific vision mechanisms. In Piu et al., the authors acquired experimental data and then performed statistical analysis on the data to obtain a representation of pupil size changes. They analyzed oscillatory dynamics of the pupil at rest by extracting features from the cross-recurrences of these oscillators as expressed in the power spectrum. The authors state that their novel analysis approach can form an adaptable diagnostic tool for identifying alertness and/or pathological status and thus might assist in clinical assessments of pathologies associated with the autonomous nervous system. In Reynaud and Hess, the authors analyze their previously measured dataset and assess the visual disparity sensitivity of subjects across different spatial frequencies. The computational factor in their study is the data analysis methods in which they applied inter-correlations and factor analysis on the data and found two spatial frequency channels for disparity sensitivity: one tuned to high spatial frequencies and one tuned to low spatial frequencies. The authors suggest that this tuning of disparity channels could be important in computer vision to design multi-scale stereo matching algorithms. In Maric and Domijan ´ , binary attention maps are modeled using a recurrent competitive network with excitatory-inhibitory nodes. The model reproduces top-down mechanisms of attentions that enhance perceived saliency of low-level features. The model is based on an extension of previously suggested Winner Take All (WTA) choice models, and is inspired by neurological components such as dendritic non-linearity that act on the excitatory units and modulate synaptic transmission. The model integrates a large set of data in visual attention and successfully predicts several attentional effects including the ability to integrate information across space and time to form the intersection or union of two maps that are defined by different features.

Finally, a selection of articles uses computational models to predict and explain high level visual tasks, perceptual behavior, and visual phenomena. Some of these studies experiment with ambiguous stimuli and suggest explanations of visual system mechanisms that contribute to the stabilization of the visually perceived display content. The article Cohen-Duwek and Spitzer, models the Filling-In phenomenon and, specifically, the alternating effects in which the background of a stimulus may lead to two different types of perceived color: original or complementary color. The model successfully predicts both effects through a heat diffusion function that is triggered by both the chromatic edges of the stimulus and the achromatic remaining contours, in contrast to previous studies that use the edges as blockers for diffusion and not as triggers. In another article Cohen-Duwek and Spitzer, a computational model is presented that predicts spatial Filling-In effects such as the Watercolor illusion and the Cornsweet effects, that have several chromatic edges. The model is based on the heat diffusion equation where the scene gradients serve as heat sources. The model successfully predicts both the assimilative and non-assimilative watercolor effects, as well as additional Filling-In visual effects. The study thus supports the theory that a shared visual mechanism is responsible (or partly responsible) for the vast variety of the "conflicting" filling-in phenomena. Two articles studied motion integration using bi-stable moving visual stimuli that can induce two different percepts (e.g., coherent and transparent). In Li et al., a bi-stable moving visual stimuli of line segments was presented to participants and their individual biases were modeled using a Bayesian modeling approach indicating a preference for one of the two possible interpretations of the scene. The authors found that increasing density shows increasing bias in observers and that this effect is greater in regular patterns than in irregular patterns. The authors tested a number of Bayesian models and show that a motion segregation prior best explains the interaction of density and regularity observed in the collected experimental data. The authors suggest that bias is used by observers to stabilize visual perception of the world. In the article Liu et al., motion integration in normal observers was compared to integration by observers with Anisometropic Amblyopia, a neurodevelopmental disorder of the visual system. They showed that when the stimuli contrast is reduced, the control observers exhibit a change in percept patterns, but amblyopic eyes do not. Using Baysian modeling, the authors show that indeed contrast affects motion integration. Considering this together with the modeling outcomes, the authors suggest that there is a different motion coding mechanism in the amblyopic visual system. Finally, in Yankelovich and Spitzer, Boundary Completion was modeled, using a functional optimization approach in which there is no need to extract different image features. The model evaluates several possible interpretations of the input and assigns a cost to each. The interpretation with minimal cost is the model's output. The model successfully predicts real and illusory contours. Additionally, for ambiguous stimulus, the model is able to find multiple possible image interpretations, which are ranked according to the probability they are perceived.

A different group of papers in this special issue, propose practical algorithms and applications that were inspired by elements of the Human Visual System, or include components that do so. In Tsitiridis et al., the authors attempt to develop a system to detect "Presentation Attacks" where a person's image is illegally reproduced and used to abuse a biometric system. The authors develop a biologically-inspired presentation attack detection model, based on features that mimic neurobiological processes in the human visual system. Machine learning tools are exploited to successfully predict whether incoming data is a spoofing-attack or is a legitimate image. In the article Paulun et al., a new system for dynamic visual recognition is introduced that combines bio-inspired sensor and hardware with a brain-like spiking neural network that mimics the layered structure and the retinotopic organization of the retina and visual cortex. Following training, the network showed a very high object classification accuracy. Finally, two papers in this group deal with image quality and esthetics. In Martinez-Garcia et al., the authors address the important question of biased or imbalanced datasets and their effect on quantitative modeling of the visual system. The authors show this in a specific case of layered retinacortex models that learn to predict subjective quality ratings of images. They show that the database under-represents certain stimuli (such as cross-masking between different frequencies) and thus the model trained on this database does not generalize well. The authors show that by augmenting the database with synthetic examples, the model shows significant improvement in performance and generalization. The authors impress that naturalistic databases should be combined with artificial stimuli to improve model performance.

In the comprehensive review Brachmann and Redies, the authors describe the advances achieved by the Vision Science and the Computer Vision communities in the parallel fields of experimental visual aesthetics and computational visual aesthetics. The paper highlights the similarities between the types of features exploited for these tasks by both communities and the similarities between the quantitative tools used to analyze and define these features. The review covers models and algorithms that supply prediction of ratings, style, and artist identification as well as computational methods in art history of painting and photograph images. The review covers methods at both sensorial (low-level bottom-up) and cognitive levels (highlevels), including modern methods of deep learning. In addition, the review summarizes results from the field of experimental aesthetics and deal with several specific image properties. The authors show that a close interaction between computational and experimental approaches are fundamental to answering difficult questions.

In this special issue, we have collected a variety of articles that look at the intriguing cycle of: visual system, computational models, and applications. The studies show how computational models can explain the vision system from the neuronal level to the behavioral level providing understanding, and novel insights. On the other hand, the visual system provides ideas and inspiration for the computational units and driving rules of the models. The interaction cycle continues with the design of practical algorithms and applications in the field of computer vision, that arise from the computational models and the ensuing understanding of the visual system. Some of the papers in this collection, even succeeded in achieving algorithms that perform on par with state-of-the-art capabilities, due to the adoption of ideas from the visual systems. Other papers provide inspiration for future possible algorithms to accomplish different visual tasks.

Within this cycle of mutual contributions, we can learn some intriguing ideas and raise interesting questions.

A recurring notion is the idea of the visual system providing educated guesses on the visual scene, based on the visual input as well as on priors, and internal representations and computations. Multi-stable inputs in the 3D world, occluded and ambiguous scenes, allow several interpretations. However, these are processed by the visual system that considers the possible interpretations and produces an "educated guess" as the best explanation of the visual scene. Such a mechanism tends to lend stability and consistency to our visual world.

An interesting insight that has been previously established, is the importance of visual illusions as a basis for research on the visual system. As several of the articles in this issue have shown, illusions serve to mirror "errors" and "biases" of the visual system as well as provide a window into the visual system's mechanics via visual perception.

Finally, we note that several of the articles introduce the notion of aesthetics of the visual scene and raise the point that beyond a comprehensive review, a small step has been taken toward the famous philosophical-psychophysical problem also regarding to visual aesthetics through the discussion of originality and creativity.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Spitzer, Otazu and Hel-Or. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Characterization of Spatial Frequency Channels Underlying Disparity Sensitivity by Factor Analysis of Population Data

Alexandre Reynaud\* and Robert F. Hess

*McGill Vision Research, Department of Ophthalmology, McGill University, Montreal, QC, Canada*

It has been suggested that at least two mechanisms mediate disparity processing, one for coarse and one for fine disparities. Here we analyze individual differences in our previously measured normative dataset on the disparity sensitivity as a function of spatial frequency of 61 observers to assess the tuning of the spatial frequency channels underlying disparity sensitivity for oblique corrugations (Reynaud et al., 2015). Inter-correlations and factor analysis of the population data revealed two spatial frequency channels for disparity sensitivity: one tuned to high spatial frequencies and one tuned to low spatial frequencies. Our results confirm that disparity is encoded by spatial frequency channels of different sensitivities tuned to different ranges of corrugation frequencies.

#### Edited by:

*Hedva Spitzer, Tel Aviv University, Israel*

#### Reviewed by:

*John E. Lewis, University of Ottawa, Canada Hagit Hel-Or, University of Haifa, Israel Ronen Segev, Ben-Gurion University of the Negev, Beersheba, Israel*

#### \*Correspondence:

*Alexandre Reynaud alexandre.reynaud@mail.mcgill.ca*

> Received: *15 February 2017* Accepted: *28 June 2017* Published: *11 July 2017*

#### Citation:

*Reynaud A and Hess RF (2017) Characterization of Spatial Frequency Channels Underlying Disparity Sensitivity by Factor Analysis of Population Data. Front. Comput. Neurosci. 11:63. doi: 10.3389/fncom.2017.00063* Keywords: disparity sensitivity, qDSF, binocular vision, stereopsis, individual differences, factor analysis

# INTRODUCTION

The visual system utilizes the displacement or disparity in the two images seen by the two eyes to compute the depth of objects. In terms of the underlying mechanisms, Pulliam (1982) first suggested that there were two global disparity mechanisms, one tuned to low spatial frequencies involving coarse disparities and one tuned to high spatial frequencies involving fine disparities. Yang and Blake (1991) also argued for only two spatial frequency channels for disparity processing and their model was later refined by Tyler et al. (1994). Additional evidence for two spatial frequency channels subserving disparity processing comes from the work of Norcia et al. (1985); Wilcox and Allison (2009); Witz et al. (2014). However, other studies suggest a multiple channels model (Julesz and Miller, 1975; Glennerster and Parker, 1997; Serrano-Pedraza et al., 2013).

Assessing the tuning of these channels has been of great importance for mechanistic models of stereo computer vision (Marr and Poggio, 1979; Nishihara, 1984; Quam, 1987; Rohaly and Wilson, 1993). These can be used to map different scales of matching in hierarchical structures (Nishihara, 1984; Quam, 1987) with, for instance, coarse-to-fine constraints (Rohaly and Wilson, 1993). In robotic vision, these tuning properties can be used to calibrate cameras (Tsai, 1986) and vergence algorithms (Piater et al., 1999; Lonini et al., 2013).

While most studies have used masking paradigms to characterize spatial frequency channels for stereopsis (Julesz and Miller, 1975; Yang and Blake, 1991; Shioiri et al., 1994; Tyler et al., 1994; Glennerster and Parker, 1997; Prince et al., 1998; Serrano-Pedraza et al., 2013), another possibility comes from factor analysis of population data (Read et al., 2016). The individual differences are then treated as systematic and meaningful, reflecting the true variability of underlying mechanisms rather than random noise (Peterzell, 2016). Identifying the sources of variability within the population will inform on the common processing mechanisms. Therefore, spatial and temporal frequency channels can be characterized by analyzing individual differences and correlations. The rationale is that the correlation in detection thresholds for pairs of stimuli should be higher for stimuli detected by the same mechanism than for stimuli detected by different mechanisms (Owsley et al., 1983; Sekuler et al., 1984; Billock and Harding, 1996). Hence by looking at the inter-correlations between individuals' sensitivity at neighboring frequencies, one is able to determine the presence of frequency channels (Mayer et al., 1995; Billock and Harding, 1996; Peterzell and Teller, 2000; Simpson and McFadden, 2005; Rosli et al., 2009). Therefore, a factor analysis of the dataset consisting of a principal component analysis (PCA) and a rotation of the factors in order to determine a simple structure can characterize the tuning curves of the channels (Simpson and McFadden, 2005). Using factor analytics within the population sensitivities Peterzell and Teller (1996, 2000) assessed spatial frequency channels tuning for luminance and color contrast sensitivities. Here we use similar methods to analyze individual differences in our previously measured normative dataset on disparity sensitivity as a function of spatial frequency for oblique corrugations of 61 observers (**Figure 1**; Reynaud et al., 2015) in order to assess the spatial frequency tuning of the underlying disparity channels.

#### METHODS

In this paper, we analyze the normative dataset for the disparity sensitivity as a function of spatial frequency of 61 observers (25 males, 36 females, mean age 26 years, ±5.7 SD, with normal or corrected to normal-visual acuity) we measured previously using the quick Disparity Sensitivity Function (qDSF, Reynaud et al., 2015), a method adapted from the quick Contrast Sensitivity Function (qCSF, Lesmes et al., 2010).

The stimuli used in this dataset were stereograms composed of spatially filtered 2-D fractal noise carriers with oblique (45◦ or 135◦ ) sinusoidal corrugations at 0.24, 0.33, 0.46, 0.64, 0.89, 1.23, 1.72, and 2.39 c/d. The spatial frequency of the carrier was 4 times the spatial frequency of the corrugation (see Reynaud et al., 2015). Disparity was modulated and the subjects' task was to identify the orientation of the corrugation in depth (45◦ or 135◦ ) in a single-interval identification task to measure the disparity detection threshold. Stimuli were displayed on a passive wide 23′′ 3D-Ready LED monitor ViewSonic V3D231, viewed with polarized 3D glasses at 70 cm, in a dim-lit room. Measured individual disparity sensitivity functions as a function of spatial frequency and their average are reproduced in **Figure 1**. Analysis was performed with Matlab R2016a (The MathWorks). The hierarchical clustering analysis was specifically performed with the statistics and machine learning toolboxes functions.

#### RESULTS

The average disparity sensitivity peaks are in the high spatial frequency range, around 1.2 c/d. However, we can observe a large variability in the individual sensitivities: some showing a low-pass, band-pass or high-pass profiles (**Figure 1**). Hence a factor analysis of these sensitivities might provide insight into the common mechanisms mediating them.

FIGURE 1 | Normative dataset. Disparity sensitivity as a function of spatial frequency is reported for 61 individual observers (thin color lines) and their average (thick black line). Sketches at the top illustrate the stimulus at different corrugations frequencies. Adapted with permission from Reynaud et al. (2015).

**Figure 2** represents the scatterplot matrix of inter-correlations (Peterzell, 2016) for log-disparity sensitivity of all 61 observers. In each cell within the figure, the scatterplot represent the intercorrelation of the log-disparity sensitivity of all observers at one frequency (frequency indicated on the diagonal in the same row) as a function of their sensitivity at another frequency (frequency indicated on the diagonal in the same column) are depicted. For instance, in the bottom-left cell, the log-disparity sensitivity of each observer at 0.24 c/d is plotted pairwise against its log-disparity sensitivity at 2.39 c/d. Then the coefficient of determination R <sup>2</sup> between the two frequencies is computed. Two regions of high inter-correlations (R <sup>2</sup> > 0.5) at low spatial frequency (green) and high spatial frequency (blue) appear along the diagonal.

These two regions are supported by the hierarchical clustering analysis of the log-disparity sensitivity at all spatial frequencies. The pairwise distance between observations was calculated as one minus the sample linear correlation between observations and the hierarchical cluster tree was computed with the average distance. The resulting dendrogram is represented at the right of the inter-correlation matrix, with each spatial frequency being the leaves. Nevertheless, we can note that different distance measures and different linkage procedures can result in relatively different final clusters, some grouping the 3 lowest and 5 highest frequencies for instance. The two cluster branches whose linkage is less than the default 70% are represented in blue and green. As for the first qualitative approach, these two groups suggest the presence of two spatial frequency channels for disparity sensitivity, which might correspond to the coarse and fine disparity channels.

observations and the hierarchical cluster tree was computed with the average distance.

In order to determine the precise tuning of these channels, we performed a factor analysis on the dataset. If we decompose the full dataset with a principal component analysis (PCA), we obtain the components shown in **Figure 3A**, with a percentage of explained variance (calculated from the eigenvalues of the PCA) associated with each component reported in the scree plot **Figure 3B**.

The first component has the shape of the average sensitivity (see **Figure 1**). The two first components (blue and green) explain more than 91% of the variance and the elbow of the scree plot occurs between the second and third components (**Figure 3B**). As we previously identified two regions of high inter-correlations and that this percentage of explained variance is considered enough to accurately describe the data (Simpson and McFadden, 2005), these two principal components were picked to describe the underlying disparity sensitivity channels. In order to make sense of them, these two principal components, or factors, were then rotated using a varimax orthogonal rotation to obtain a simple structure accounting for the channel tuning curves (Kaiser, 1958; Peterzell and Teller, 2000; Simpson and McFadden, 2005; Peterzell, 2016). These factors-tuning curves are reported in **Figure 3C**. The first factor peaks at the highest measured frequency 2.4 c/d and the second peaks around 0.65 c/d. They characterize the high and low spatial frequency channels identified by the inter-correlation analysis (respectively blue and green regions in **Figure 2**).

We wanted to test if the two channels we identified could in fact account for different classes within the population. In order to estimate the weights β of each of these factors in each individual sensitivity, we projected our dataset onto the basis defined by the two identified factors. The best linear unbiased estimator of β is obtained using the Moore-Penrose pseudo inverse X<sup>+</sup> (equation 1):

$$\boldsymbol{\beta} = \mathbf{X}^+ \mathbf{y} \tag{1}$$

where y is the matrix of all individual sensitivities, X<sup>+</sup> is the Moore-Penrose pseudo inverse of the new basis matrix X whose two columns represent the two factors and β is a two-rows matrix in wihich each column contains the pair of weights associated to the two factors estimated for each subject (Friston et al., 1995; Woolrich et al., 2004; Reynaud et al., 2011).

The sensitivities yˆ reconstructed solely from the linear combination of these two factors are plotted in **Figure 4A**

(Equation 2):

$$
\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta} \tag{2}
$$

We can see that they overall faithfully reproduce the original sensitivities except for the very low-pass profiles whose peaks shift to the right.

To determine whether these channels can account for different classes within the population, we report a scatterplot of the weights β<sup>1</sup> of the first factor vs. the weights β<sup>2</sup> of the second factor in **Figure 4B** for all observers. The mean weights for the first and second factor are, respectively, 1.76 and 1.48. As expected from the explained variance (**Figure 3B**), the weight of the first factor—the high-frequency channel—is greater than the weight of the second—the low frequency channel—in 70% of the cases. The distribution of these weights appears homogeneous and no clusters are revealed. However, the weights of the first factor seem to be relatively greater than the weights of the second in the high values range whereas it seems to be slightly the opposite in the low values range. This is further revealed by the slope of the linear regression between the log-values of the weights 0.53, which is inferior to 1 (dashed line). In fact, the correlation between the weight is very high (coefficient of determination R <sup>2</sup> = 0.51, p < 0.0001). Altogether, these observations suggest that the weight of the low and high spatial frequency channels co-vary: when the sensitivity is high for the low frequency channel, it is high for the high frequency channel too. But the high frequency channel contributes relatively more when the sensitivity is high and the low-frequency channel contributes relatively more when the sensitivity is low, in accordance with our previous observations (Reynaud et al., 2015).

### DISCUSSION

The qDSF method assumes the sensitivity function follows the truncated log-parabola model and hence has a bell shape with a constant part, an increase to a peak and a drop-off (Watson and Robson, 1981; Lesmes et al., 2010). We previously showed that this model can accurately represent the sensitivity function compared to non-constrained methods (Reynaud et al., 2015) and documents large differences in sensitivities within the population (see **Figure 1**). For different individuals, this function can peak at very different frequencies and can show lowpass, band-pass or high-pass profiles. The resultant variability in sensitivity across spatial frequency provides a rich dataset for inter-correlation analyses (Peterzell et al., 1995; Peterzell, 2016).

Because two regions of inter correlations were identified among the population in **Figure 1** and because 2 components accounted for more than 91% of the variance, our data could accurately be described by just 2 channels. However, the criterion to select the number of meaningful components in a PCA may vary. Popular selection methods such as a scree plot (Jackson, 1993) or the Random average under permutation analysis will indeed determine 2 components while some other methods will give less (the broken stick method gives 1 component) or more (the parallel analysis gives barely 3, the kaiser Guttman criterion which recommends eigenvalues >1 gives 3 too). Some methods such as the Bartlett tests even recommends all the 8 components which would not reduce the dimensionality of the data (Bartlett, 1950). A complete description of these methods can be found in Peres-Neto et al. (2005).

Hence, we cannot completely rule out the possibility of a single-channel or multiple-channels hypothesis. Serrano-Pedraza and Read reported a single channel mechanism specific to vertical corrugations (Serrano-Pedraza and Read, 2010, though see Witz et al., 2014). However, the large difference we can observe between the lowpass profile of sensitivity for some observers compared to the bandpass of other ones would indicate that more than one channel are involved. Several studies suggested a multiple-channels mechanism (Julesz and Miller, 1975; Schumer and Ganz, 1979; Cobo-Lewis and Yeh, 1994; Glennerster and Parker, 1997; Serrano-Pedraza et al., 2013) with a broad channel tuning of ∼2–3 octaves, comparable to our observations (Schumer and Ganz, 1979; Cobo-Lewis and Yeh, 1994). It is then possible that the 2 channels we observe are part of a multiple-channels system covering a wider range of spatial frequencies or could also overlap with intermediate channels continuously covering the spatial frequency range. Yang and Blake (1991) also observed two spatial frequency channels for disparity sensitivity using a masking paradigm. They described one channel centered around 3 c/d which could correspond to the high spatial frequency channel we observed and one centered around 5 c/d. However, their study and the present study didn't measure the same spatial frequency range which might explain why they didn't identify our low spatial frequency channel and why we didn't observe their high one.

# REFERENCES


The results of the present study suggests that there are two channels (**Figure 4B**), a low frequency channel that contributes to the detection of low corrugation frequencies and a more sensitive high frequency channel that contributes to the detection of high corrugation frequencies. We didn't observe any dichotomy based on these two channels within our population (Wilcox and Allison, 2009) which confirms the observations of most other population studies (Coutant and Westheimer, 1993; Bohr and Read, 2013; Bosten et al., 2015).

The implications of the assessment of the tuning of these disparity channels could be important in computer vision to design behaviorally relevant stereo matching algorithms. For instance, it could be used to tune the different layers of multiscale algorithms (Rohaly and Wilson, 1993) or provide fine and coarse scales for algorithms processing in center and periphery, respectively, as stereopsis could be mediated by different mechanisms in central and peripheral vision (Wardle et al., 2012; Witz and Hess, 2013).

# CONCLUSION

The analysis of the inter-correlations in the disparity sensitivity as a function of the spatial frequency, revealed two disparity channels. With a factor analysis of the population data, we determined that the first channel is tuned to high spatial frequencies (peaks at 2.4 c/d) and the second is tuned to low spatial frequencies (peaks at 0.65 c/d). We also observed that these two channels are well correlated with each other. Our results confirm that disparity is encoded by multiple spatial frequency channels that are of different sensitivities and subserve different ranges of corrugation frequencies.

# AUTHOR CONTRIBUTIONS

AR and RH designed the research and wrote the manuscript. AR analyzed the data.

# FUNDING

This work was supported by a Natural Sciences and Engineering Research Council of Canada grant (NSERC #46528) to RH.

# ACKNOWLEDGMENTS

We thank the three reviewers for their helpful comments and suggestions. This work was supported by a Natural Sciences and Engineering Research Council of Canada grant (NSERC #46528) to RH.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Reynaud and Hess. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computational and Experimental Approaches to Visual Aesthetics

#### Anselm Brachmann and Christoph Redies\*

Experimental Aesthetics Group, Institute of Anatomy, Jena University Hospital, School of Medicine, University of Jena, Jena, Germany

Aesthetics has been the subject of long-standing debates by philosophers and psychologists alike. In psychology, it is generally agreed that aesthetic experience results from an interaction between perception, cognition, and emotion. By experimental means, this triad has been studied in the field of experimental aesthetics, which aims to gain a better understanding of how aesthetic experience relates to fundamental principles of human visual perception and brain processes. Recently, researchers in computer vision have also gained interest in the topic, giving rise to the field of computational aesthetics. With computing hardware and methodology developing at a high pace, the modeling of perceptually relevant aspect of aesthetic stimuli has a huge potential. In this review, we present an overview of recent developments in computational aesthetics and how they relate to experimental studies. In the first part, we cover topics such as the prediction of ratings, style and artist identification as well as computational methods in art history, such as the detection of influences among artists or forgeries. We also describe currently used computational algorithms, such as classifiers and deep neural networks. In the second part, we summarize results from the field of experimental aesthetics and cover several isolated image properties that are believed to have a effect on the aesthetic appeal of visual stimuli. Their relation to each other and to findings from computational aesthetics are discussed. Moreover, we compare the strategies in the two fields of research and suggest that both fields would greatly profit from a joined research effort. We hope to encourage researchers from both disciplines to work more closely together in order to understand visual aesthetics from an integrated point of view.

#### Edited by:

Xavier Otazu, Universitat Autònoma de Barcelona, Spain

#### Reviewed by:

Qing Yun Wang, Beihang University, China Jesús Malo, Universitat de València, Spain

#### \*Correspondence:

Christoph Redies christoph.redies@med.uni-jena.de

Received: 30 May 2017 Accepted: 30 October 2017 Published: 14 November 2017

#### Citation:

Brachmann A and Redies C (2017) Computational and Experimental Approaches to Visual Aesthetics. Front. Comput. Neurosci. 11:102. doi: 10.3389/fncom.2017.00102 Keywords: computational aesthetics, experimental aesthetics, visual preference, art history, artist identification, style identification, image features, statistical image properties

# 1. INTRODUCTION

Dating back more than two thousand years ago, aesthetics has been the subject of debates by philosophers and other scholars alike. Defined by the Oxford Dictionary as "the philosophy of the beautiful or of art," "a system of principles for the appreciation of the beautiful," and "the distinctive underlying principles of a work of art or a genre" (OED, 2017), aesthetics represents a field of interest that has attracted researchers from diverse scientific disciplines, also outside of philosophy. In 1876, the founder of experimental aesthetics, Gustav Fechner, published his seminal book entitled "Vorschule der Ästhetik" (Fechner, 1876). He believed that the aesthetic appeal of physical objects manifests itself in stimulus properties that can be measured in an objective (formalistic) way. Specifically, he attempted to show that rectangles with an aspect ratio equal to the golden ratio are more appealing to human observers than rectangles having other aspect ratios. Researcher have later raised concerns about the normative role in rectangular preferences (Green, 1995; McManus et al., 2010). Nevertheless, Fechner's scientific (objective) view of aesthetics provided the basis for the newly emerging field of empirical aesthetics. In this field, hypotheses regarding the perceived beauty of images, paintings or even every-day objects are proposed and tested experimentally for their validity. This stimulus-driven approach, called by Fechner aesthetics from below, was different from the aesthetics that was prevalent in Fechner's time and derived aesthetic principles from superordinate philosophical concepts (aesthetics from above) (Cupchik, 1986). Fechner is also credited for conceiving the field of psychopysics, which relates human perception to welldefined physical properties of stimuli. By applying this approach to aesthetics, he attempted to relate physical image properties to aesthetic perception in humans. The area of research that has taken up this idea in modern times is experimental aesthetics, a subfield of psychology.

Another discipline of natural science that studies aesthetics is neuroaesthetics, a subfield of brain research. In this field, modern imaging techniques, such as functional magnetic resonance imaging (fMRI), enable researcher to study the activation of brain regions when human observers view aesthetic stimuli (Cela-Conde et al., 2011; Chatterjee and Vartanian, 2014). This type of research has lead to a better understanding of what neural networks are involved in the human brain when we have an aesthetic experience. Research in neuroaesthetics is beyond the scope of the present review.

In recent years, aesthetics has also been studied using computational methods. In the field of computer science, computational aesthetics, a subfield of computer vision, has entered the field of aesthetics. In this area, there have been a variety of different studies on the aesthetics in digital images, for example, using digital reproductions of paintings. The birth of computational aesthetics is often attributed to Birkhoff's book "Aesthetic Measure" (Birkhoff, 1933), although the book does not mention the term itself (for an overview of the evolution of the term, see Greenfield, 2005). In a very mathematical way, Birkhoff proposed a formula for an aesthetic measure M, which is a function of O, order or reward by a positive tone of feeling, and C, complexity or a feeling of effort of attention. Stating that reward should be proportional to effort, Birkhoff concludes that M = O/C best describes their relation.

A definition of computational aesthetics is given by Hoenig (2005), who describes it as "[...] the research of computational methods that can make applicable aesthetic decision in a similar fashion as humans can." To Hoenig, this definition emphasizes two major aspects: First, the use of computational methods, and second, their applicability to aesthetic decision making. More precisely, Galanter (2012) discusses how computational aesthetics is concerned with both, "the creation and evaluation of art using computers." He argues that the creation of art necessarily requires evaluation and gives the example of an artist, who, while learning about aesthetics and gathering experience, evaluates art created by others. When creating artworks himself, micro-evaluations help the artist guide his own creative process. Upon finishing his creation, the artist gains new insights about his art in a final evaluation of the created piece. Given the importance of the evaluation process, we will focus on it in the present review. As pointed out by Stork (2009a), the computational analysis of paintings has several advantages compared to an analysis carried out by human experts. For example, a computational analysis can pick up very subtle relationships that may escape the attention by human observers; moreover, computational methods are objective in nature and are potentially non-exhaustive in the amount of detail analyzed (e.g., every single brushstroke in a painting).

The aim of the present review is to provide an overview of recent developments in the field of computational aesthetics and to point out its potential relevance for research in experimental aesthetics and vice versa. Our goal is to boost the awareness of researchers in experimental aesthetics for the wealth of data that computational aesthetics has generated in recent years. We would also like to inform scientists in computational aesthetics about some basic concepts and results from experimental aesthetics. Our review thus outlines a possible link between research on the objective (physical) properties of visual stimuli and experimental studies that take into account the subjective responses of humans to aesthetic stimuli, as originally proposed by Fechner. Specifically, we focus on the evaluation of visual images (photographs or digitally reproduced artworks) and the analysis of image properties. Important areas of research will be referenced and exemplary works will be presented, without striving for completeness. Topics include the prediction of ratings of photographs and paintings, the classification of images regarding their artist or style, computational methods for problems in art history, and, finally, the investigation of statistical properties of aesthetically pleasing images and artworks.

# 2. COMPUTATIONAL AESTHETICS: ALGORITHMS AND APPLICATIONS

Computational aesthetics is approached from different points of view. All articles reviewed here somehow deal with aesthetics in the form of photography and paintings and are motivated predominantly by producing applications and testing or improving algorithms. Accordingly, one of the tasks that is often pursued in computational aesthetics is to develop algorithms that allow to predict aesthetic ratings of photographs. Such algorithms have direct applications. For example, in online photo communities (for example Flickr, Photo.net, etc.), they can be used to select photographs of high aesthetic quality and discard snapshots that users would rate low. On a more commercial side, such systems are used for retrieving and licensing high-quality photographs from the internet for their use as stock photographs. Another possible application is to install such algorithms in industrial cameras and smartphones, which identify high-quality images in the split of a second. As we will show in the present article, there has been a tremendous success in building such systems.

The prediction of ratings is just one possible application among many, where computers can make decisions regarding aesthetics. Computational methods have also been successfully applied to problems in art history, such as content analysis of paintings, forgery detection, or detection of a painter's influence. These applications will also be reviewed in the following sections.

# 2.1. Prediction of Ratings

One major trend in computational aesthetics is to predict ratings of image quality or aesthetic appeal. Possible applications of this technology are improved cameras, which automatically select the most appealing photos among many, optimization of advertisements for their aesthetic value, or even talent scouting in photo-sharing communities. In the early days of computational aesthetics, researcher followed the then popular practice to design features explicitly for a given task. In order to predict the aesthetic appeal of a given image, researchers determined in how far different photographic principles, like composition according to the rule of thirds or depth of field, were followed in images. They quantified these principles by expressing them numerically, either as binary or continuous values, called features. Features can be either local, describing only pixels or patches and their immediate neighborhood, or they can be global and describe properties of the image as a whole. Global features seem especially suitable to describe artistic photographs or artworks because concepts such as artistic composition refer to the relation between pictorial elements across the image. Another difference can be made concerning the level of abstraction: Low-level features describe basic features, such as colors and edges, while high-level features can describe more abstract image content. The features can then be used to train a classifier on a dataset of images so that it can learn to predict ratings given by humans. This goal is achieved by mathematically describing the relation between the subjective scores and the feature set. Popular choices for classifiers are, for example, Bayes classifiers, Decision Trees, or Support Vector Machines (SVMs). This approach will be presented in more detail in section 2.1.1. In recent years, computational aesthetics has gone from designing features by hand to using generic features that have been developed for other purposes in computer vision. This development has reached a pinnacle with the development and widespread use of Deep Neural Networks. Approaches using generic features will be discussed in section 2.1.2.

#### 2.1.1. Hand-Crafted Image Features

One of the first attempts to measure aesthetics in an image was published by Tong et al. (2004), who proposed a method to distinguish between photographs taken by professional photographers and photographs taken by non-expert (home) users. They used a set of low-level features that describe blur, contrast, colorfulness and saliency, and combined it with general purpose low-level features that capture texture, shape and energy in the frequency spectrum, by using difference-edge histograms. In total, they proposed 21 different features which added up to 846 dimensions. After reducing the dimensionality, they reported classification results comparing Boosting, an SVM and a Bayesian classifier, which performed best.

Using another set of low-level features, Datta et al. (2006) build a classifierfor distinguishing images of high aesthetic appeal from other images, as rated by the community of the popular photosharing website Photo.net. Overall, the authors collected 3,581 different images and split them into two classes according to their aesthetic rating by the users of the site (low and high rating). They explicitly stated that their goal was not to build the bestperforming classifier, but rather to be able to draw conclusions from the best performing features. Their choice of features was based on common intuition, rules of thumb in photography and trends that they observed for the ratings of the collected images. In total, they proposed a set of 56 different features, containing basic ones, such as colorfulness, saturation, hue, size and aspect ratio, as well as adherence to the rule of thirds. The features were selected as follows: First, the authors used a one-dimensional SVM to find the features with the most discriminative power and selected the top 30. Starting with an empty features set, they then iteratively added those features that improved the classification the most. As a result, they found that average hue, average pixel intensity as well as a saturation-based rule of thirds measure contributed the most to the aesthetic value of an image, as rated by human observers.

Ke et al. (2006) designed a system to distinguish between highquality professional photographs and low-quality snapshots. They reference the work of Tong et al. (2004) but criticize their black-box approach, which prevents them from gaining any insight into why some photos are better than others, although the system by Tong and colleagues performed well for the task. Ke et al. (2006) therefore chose an approach similar to the one by Datta et al. (2006) and designed a set of features that capture image quality. They based their choice of the features on interviews conducted with photographers. Their feature set contained the spatial distribution of edges, color distribution, hue count and blur as well as contrast and brightness. For classification, they used a naive Bayes classifier and tested their system on images that were downloaded from a photo contest website. The blur feature turned out to be the most discriminative metric.

Luo and Tang (2008) extracted very simple features that captured lighting, simplicity, composition or color harmony, based on the subject region and the background of an image. They reported an improvement of classification upon Datta et al. (2006) and Ke et al. (2006) and contributed this success to the distinction of foreground and background, while the previous methods computed their features on the image as a whole.

Besides focusing on low-level features as provided by Ke et al. (2006) and Dhar et al. (2011) also integrate high-level attributes in their system in order to predict aesthetic value and interestingness. According to the authors, high-level attributes define characteristics of images as humans would describe them, and can be classified into compositional attributes (like the rule of thirds), content attributes (like the presence of people) and sky illumination attributes. Dhar et al. (2011) reported improved performance compared to the approach by Ke et al. (2006).

Although the general focus of aesthetic quality assessment in computational aesthetics is on the prediction of ratings of photographs, a few researchers have also proposed methods for quality assessment of paintings. Li and Chen (2009), for example, propose a total of 40 features that capture color, brightness and compositional characteristics of a paintings. Using these features, they use a Bayes classifier as well as AdaBoost on a binary task to predict whether a painting received high or low rating scores. In their work, they provide a detailed discussion of the importance of the individual features.

What all these approaches have in common is that a combination of multiple features is used to predict aesthetic ratings. While this has proven successful for automated aesthetic decision making, there are a number of problems that preclude a deeper understanding of the role of individual features in these decisions. First, because the features are not necessarily independent of each other, it would require more sophisticated statistical methods to extract the influence of each of them. Second, the experimental conditions, under which ratings are obtained in most of the above-mentioned studies, are unknown, unspecified or variable (for example, with regard to the size of the stimuli on the retina, the brightness of the stimuli, contrast settings of the monitors, background illumination, sequence of stimulus presentation etc.,). Third, the rating by users of internet platforms often remain anonymous which precludes any specification of their personal characteristics (sex, age, cultural background etc.,). All these factors might influence the results or introduce artifacts.

In experimental aesthetics, some of the features used in the above combinatorial approaches have been isolated and studied in psychological experiments under well-defined experimental conditions (for a survey of such studies, see section 3).

#### 2.1.2. Generic Image Features

Generic image features are features that are not explicitly designed for the prediction of image aesthetics, but rather for other popular research topics in computer vision, like object detection and classification, scene understanding, or image retrieval. An example of such features are the SIFT descriptors (scale-invariant feature transform; Lowe, 2004), which were originally designed for feature matching and image stitching. SIFT encodes edge orientations in gray-scale images as a vector (for more recent image descriptors, see Canclini et al., 2013).

The first study to model aesthetic ratings based on generic image features was published by Marchesotti et al. (2011). They used SIFT descriptors together with a color descriptor, motivated by the assumption that aesthetic properties, such as the presence of sharp edges or the saturation of colors, can be described implicitly by these kind of features. The authors chose a Bag-Of-Visual-Words and a Fisher-Vector representation in order to represent prototypical patches for aesthetic and nonaesthetic photographs. As a result, they reported an improvement in classification rates for high-quality and low-quality images, compared to the methods by Datta et al. (2006) and Ke et al. (2006) who used hand-crafted features (see section 2.1.1). While hand-crafted features allow to quantify which feature contributes the most to an aesthetic rating, this interpretability is lost with generic features. Here, conclusions can only be drawn by a comparison of the images that are rated high or low by the model because the features of the model are not deliberately designed to capture known properties of aesthetics, but they rather hide their relation to them. For example, Marchesotti et al. report that all blurry and low-resolution images were rated low in his model, whereas images that displayed foreground objects with sharp edges on out-of-focus backgrounds were rated highly. Moreover, highly-rated images had a dominant color or used complementary colors in their palette; if too many colors were present, images received low scores in general. On the same dataset, Murray (2012) used a low-level contrast model that was originally developed for saliency estimation and showed that it can also be applied to predict aesthetic preferences.

In recent years, deep learning models, in particular Convolutional Neural Networks (CNNs), have started to conquer many subareas in the field of computer vision and artificial intelligence. Although the basic idea of CNNs has already been proposed more than three decades ago (Fukushima, 1980; Lecun and Bengio, 1995), only recently, progress in computing technologies and the availability of huge datasets for training have helped to restore the interest in using CNNs for image processing (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; He et al., 2015; Huang et al., 2016). CNNs learn a hierarchy of filters, which are applied to an input image in order to extract meaningful information from the input. The training is done using backpropagation, a supervised training algorithm, in which the current output of a network is compared to a desired output. Filter parameters of the network are changed according to their contribution to the current error. When used on a large training set of images, CNNs tend to learn features that resemble Gabor-like edge detectors and color-opponent filters at lower layers of the CNNs. These features are akin to neural responses in the early mammalian visual system. On higher layers of the CNNs, features capture more abstract image content by integrating the lower-layer features (Yosinski et al., 2015). Different open-source implementations exist, which also include a variety of models that were pretrained for object or scene recognition. Their availability enables researchers to either retrain networks that already work well for recognition tasks (a process called fine-tuning), or to use features from pretrained models without any further modification.

CNNs have been applied to the task of rating image aesthetics. Lu et al. (2015) trained a two-column deep neural network simultaneously on global and local views of photographs in order to predict their aesthetic rating class (high or low). The authors motivated their architecture by the observation that the aesthetics of an image is influenced by local cues, such as sharpness, as well as global cues, which capture compositional aspects. They evaluated different cropping strategies for the local image view and report a higher accuracy in the prediction of image aesthetics than reported for previous approaches on the same dataset (Murray et al., 2012).

Dong et al. (2015) applied the AlexNet architecture presented by Krizhevsky et al. (2012), which was trained on 1.2 million images to discriminate between 1,000 different object categories. They used the features of the top convolutional layer, which are computed on the entire image, as well as on five local crops, and trained an SVM on the concatenated features. They improved upon the results by Marchesotti et al. (2011) by a margin of about 10%. Interestingly, their approach did not explicitly use features trained in the context of an aesthetic evaluation, but rather for object recognition, so that the decision whether an image was rated as highly aesthetic or not seemed to rely more on image content than on image form.

Denzler et al. (2016) proposed to use CNNs as model of perception for research in aesthetics. They trained the AlexNet model (Krizhevsky et al., 2012) on different datasets to experimentally evaluate how well pre-learned features of different layers are suited to distinguish art from non-art images using an SVM classifier. They report the highest discriminatory power with a Network trained on the ImageNet dataset, which outperforms a network solely trained on natural scenes.

Kao et al. (2016) proposed a multi-task learning approach, in which a CNN was trained to simultaneously assign semantic and aesthetic labels. They explored different network architectures and showed that a network trained to recognize semantic labels in addition to the aesthetic class outperforms a network trained solely to recognize the aesthetic class of an image. This finding is compatible with the role of both content and form in psychological models of aesthetic experience (see below).

Nowadays, deep neural networks have largely replaced the conventional approach of designing features deliberately in order to reflect aesthetic concepts that derive from human intuition. They outperform the conventional approach easily and have a number of additional advantages: (1) Deep neural networks learn features that are important for aesthetic evaluations automatically, provided that a dataset is big enough. (2) They can combine local image properties, such as sharpness or blur, with global properties, such as composition or color harmony. (3) They can even take into account abstract features, such as image content, without the explicit design of such features by humans. (4) Last but not least, deep neural networks are able to learn image properties that humans may not even be aware of. Such properties include unspecified compositional rules that are employed intuitively by photographers and painters (Bell, 1914; Arnheim, 1954; Redies, 2007, 2015).

While deep learning models are state-of-the-art in aesthetic image evaluation, their success comes at a cost. At present, the understanding of deep features and how they work in object or aesthetic recognition lacks behind. Although there have been attempts to analyze what deep neural networks actually encode at higher layers (Yosinski et al., 2015), we are far from understanding the success of deep learning in any significant detail. For applications in aesthetic image evaluation, it may be sufficient to simply build systems that closely match human perception in deciding whether an image is considered to be beautiful. However, for researchers who want to learn more about aesthetics per se, the limitations of deep learning models are particularly obvious. With handcrafted features, it is easy to draw conclusion about which features contribute to the aesthetic value of an image. Deep neural networks and generic features basically represent a black-box approach that lacks this kind of interpretability. Nevertheless, if we can develop tools to understand deep representations in the future, the drawback of deep learning approaches may eventually turn out into an asset for understanding aesthetics. Such a more profound understanding would also require that deep learning be better explainable in terms of actual neural mechanisms. Although some recent studies lead in this direction (for example, see Brachmann et al., 2017), an abundance of questions remains.

# 2.2. Other Classifications of Images

Besides the prediction of visual preference, there has been another trend in computational aesthetics, which tends to be more focused on artworks than on photography. In this trend, images are not classified according to their aesthetic appeal, but with respect to the correct identification of the painter or the artistic style, an undertaking which is usually performed by art experts. From a methodological point of view, the identification of painter and style are related tasks that often go hand in hand. However, in the early days of computational aesthetics, the identification of the artist who created a given painting (Cezanne, Vermeer, Rembrandt, etc.,) was more popular. More recently, there seems to be a shift to the prediction of the style (Realism, Impressionism, Cubism, etc.,), as works from more and more art collections become digitized and available on the web. These open-source collections enable researchers to easily collect the huge number of images that are needed in order to train and test algorithms. Possible applications for such methods are recommender systems for online art markets or the more precise description of the stylistic singularities of particular artists.

#### 2.2.1. Artist Identification

Using a Naive Bayes Classifier, Keren (2002) computed Discrete Cosine Transform (DCT) coefficients on an image and identified the painters of art images (Rembrandt, van Gogh, Picasso, Magritte, Dali) by using a voting scheme, where each 9 × 9 block of an image is assigned the style of an artist. A majority voting for an image yielded the final result and the authors reported an accuracy of 86% for choosing the correct painter. Widjaja et al. (2003) focused on nude paintings and used color of skin in order to identify the artist. They trained an SVM on color profiles of patches extracted from images of four different painters (Rubens, Michelangelo, Ingres, and Botticelli) and reported a rate of correct identifications of 85%. Li and Wang (2004) proposed a system for artist identification based on wavelets and a Multiresolution Hidden Markov Model and tested their approach on a dataset of grayscale Chinese ink images that contained works by five different Chinese artists. Besides the classification of paintings regarding their artist, they found that their modeling approach can also be used as a measure of similarity. To recognize the artist of an image, Lombardi (2005) proposed a system that used a set of low-level features for intensity, edge information, spatial frequency information, as well as a new feature that captured color. Shen (2009) combined a set of global visual features (color, textures, shape) and local visual features (Gabor wavelets) and reported an identification accuracy of 69.7% when distinguishing 25 classical Western painters in a dataset that included Caravaggio, Rubens, Vermeer, and van Gogh. For classification, they used an RBF neural network. Khan et al. (2010) automatically predicted painters (Ingres, Matisse, Monet, Picasso, Rembrandt, Rubens, Titian and van Gogh) by using a Bag-of-Visual-Words approach. They computed SIFT descriptors, as well as color name descriptors and trained an SVM on a dataset which consisted of 40 images each of the

eight artist (320 images total). They report an accuracy of 62% for the combination of color and shape features. Condorovici et al. (2013) used a dataset of 1,896 paintings by 15 different artist (including Pollock, Rembrandt, Cezanne, and Magritte), from which they extracted low-level features like an RGB color histogram and edge information by Gabor filters. The authors experimented with eight different classifiers, among which multiclass logistic regression yielded the best results. Cetinic and Grgic (2013) extracted three types of features, namely image-intensity statistics, color-based features, and texture-based features and used a multi-layer perceptron with one hidden layer; they reported a 75.3% accuracy of identifying the correct one among 20 painters.

Overall, it is difficult to compare the performance of the different methods for artist identification because a common database, on which results could be reported and compared to others, is lacking to date. Condorovici et al. (2013) addressed this problem by comparing different methods to their guessing baseline. However, this approach may give an advantage to researcher who select painters who are more diverging to begin with. For example, it may be harder to distinguish an impressionist painting by Claude Monet from one by Paul Cezanne, than to distinguish an abstract drip painting by Jackson Pollock from a surrealist painting by René Magritte.

In summary, the most popular choices for features that are used for the classifiers include a measure to capture texture or spatial frequency, edge histograms for shape detection and histograms for color analysis; all these features are low-level and do not describe image content.

More recently, classification studies in other areas of research no longer rely on one classifier, but report results for a set of different classifiers that are studied in parallel. A popular choice for this type of analysis is the Weka data mining software (Hall et al., 2009).

#### 2.2.2. Style Prediction

To predict art styles in various sets of artworks, different approaches have been used. Gunsel et al. (2005) trained an SVM classifier in order to discriminate among five painting styles (Classicism, Impressionism, Cubism, Expressionism, and Surrealism) as well as between twelve different painters. They proposed a system that computes a 6-dimensional vector of lowlevel features including brightness and gradient information of an image as well as statistics of the gray-level histogram. This system allows a user to query the system for similar paintings of unknown style. For painter and art movement classification, the authors report a high accuracy with a low number false positive results. A different approach was taken by Jiang et al. (2006) who designed a way to retrieve traditional Chinese paintings and then classify them into one of the two styles, Gongbi (traditional Chinese realistic painting) or Xieyi (freehand style). For this task, they used low-level features, which captured color, texture and edges. With a classifier that combined a decision tree and SVMs, they obtained accuracies that are suitable for practical purposes.

Wallraven et al. (2009) asked participants to group images from 11 different art periods (e.g., Gothic, Renaissance, Classicism, Surrealism and Postmodern Art) and different artists into self-selected categories. The resulting categories of artworks corresponded well with the canonical art periods. The authors then computed several low-level features of the images (e. g. raw pixel values, color histograms, frequency, or a GIST descriptor; Oliva and Torralba, 2006) and tested how well the features described the clustering into different art periods. The authors found a low correlation between their set of lowlevel features and the grouping into art periods and concluded that humans rely more on higher-layer properties. Siddiquie et al. (2009) used multiple kernel learning in their approach and chose texture, histograms of gradient orientations (HOGs), color, and saliency as their features to discriminate between seven different styles (Abstract Expressionism, Baroque, Cubism, Graffiti, Impressionism and Rennaissance). Zujovic et al. (2009) chose five different genres (Abstract Expressionism, Cubism, Impressionism, Pop Art, and Realism). As features, they used steerable filters as well as edge information extracted by a canny edge detector. For color, they calculated HSV histograms and used their bins as features. The classification was done with several different classifiers and the authors reported a best overall accuracy of 69.1% for the AdaBoost classifier. Shamir et al. (2010) classified paintings of nine artists of different genres (Impressionism, Surrealism and Abstract Expressionism) and reached an accuracy of 91.0% in style classification by using a set of features that contained frequency statistics, edge information and color information. Culjak et ˇ al. (2011) focused on texture and color features, stating that such features are closely related to the way humans perceive artworks. As genres, they chose Realism, Impressionism, Cubism, Fauvism, Pointillism and Naïve Art. They tested a range of classifiers and reported best results for an SVM, reaching 60.2% accuracy. Ivanova et al. (2012) used various MPEG-7 descriptors in order to distinguish different art styles. In their experiment, they noted that color features were better suited than texture features for distinguishing between art styles and artists. Condorovici et al. (2015) reported that key to a better accuracy in style discrimination is to let features be inspired by human perception. Accordingly, they used luminance and features that detected shape, texture, edges and color. A total of eight genres was selected for style classification in their study. Like other authors, they tested a set of classifiers and reached best results with an SVM, outperforming their predecessors.

While all articles mentioned above used low-level features, which capture formal aspects of paintings, results from Arora and Elgammal (2012) first indicated that semantic features are also important for style classification. The author compared different features and reported the best results for an SVM trained on classeme feature vectors (Torresani et al., 2010), which represent an image as combined classification scores for many weak classifiers that were trained on low-level descriptors.

Beginning with the work of Krizhevsky et al. (2012) and due to the renewed interest in deep neural networks, these models have also been applied to style prediction. Karayev et al. (2013) used a relatively large dataset of 100K images together with color features, GIST descriptors, saliency, meta-class features (Bergamo and Torresani, 2012) for image content, as well as DeCAF features (Donahue et al., 2014), which are activations of higher layers of CNNs that encode image content rather than image form. They additionally trained a classifier for content features on the categories of animals, vehicles, indoor objects and people. For 25 different painting styles, they reached a mean accuracy of 47.3% with all features in combination. Other than painting style, they also reported results for photographic styles in their article. One of their main conclusions is that style is highly dependent on content. Another approach that also relied on DeCAF features can be found in Bar et al. (2014). These authors reported that a combination of DeCAF features and PiCoDes features (Bergamo et al., 2011), a binary descriptor, which incorporates several low-level descriptors, shows the best performance in style recognition.

Saleh and Elgammal (2015) used the object labels that were produced by the networks proposed in (Krizhevsky et al., 2012) as a feature to discriminate the artist, the style and the genre of roughly 80K paintings. They concluded that classemes (Torresani et al., 2010) are the best way to represent artist, genre, and stylespecific properties for discrimination. Tan et al. (2016) conducted several experiments regarding painting style, genre, and artist discrimination and used the architecture proposed by Krizhevsky et al. (2012). They fine-tuned a model that was trained on the ImageNet (Deng et al., 2009) dataset for object recognition, trained a model from scratch, and also tested SVM classifiers on deep features. Interestingly, the fine-tuned model yielded the best results in all tasks and even outperformed the model that was trained from scratch.

Painter and style prediction go hand in hand. In the early days, hand-crafted features that captured the same type of image properties were equally suitable for both tasks. With more and more image data becoming available for training, style prediction can now be trained and tested on exceedingly large sets of images and collections of style categories can be expanded with ease. For painter identification, this is not necessarily the case because, for most artists, only a relatively limited number of paintings are available for training deep networks. As another complicating factor, many artists changed their style during their lifetime. For example, several abstract artists started their career with realistic paintings (for example, Wassily Kandinsky, Piet Mondrian, and Jackson Pollock). As a result, training deep neural networks for painter identification will likely remain more difficult than for style prediction.

For style prediction, the availability of huge collections of digitized artworks will open new possibilities for researchers who will use machine learning methods in the future. For example, popular and widely used datasets of paintings, such as the databases of the Google Art Project and WikiArt (formerly WikiPaintings), contains several thousands of annotated artworks.

As outlined for rating prediction (section 2.1), deep features are getting more and more popular for style prediction and increasingly replace hand-crafted features because they are capable of representing semantic information also. For example, Chiaroscuro style paintings often depict indoor scenes and people, while Impressionist paintings frequently display landscapes. Therefore, deep features do well on style prediction and prove to be more powerful than low-level features that focus on image form only. On the other hand, as with the prediction of ratings, interpretability is not as high as it has been with purposely designed features.

Although the vast area of computer-generated artistic images is beyond the scope of the present review, we would like to point out that deep models have boosted recent developments in this area that harbor a large potential for understanding aesthetics. Gatys et al. (2016) proposed an algorithm that can transfer the style of any image to another, by matching the statistics of the gram matrix of lower-layer features, as well as image content that is represented at higher layers. They demonstrated that arbitrary images can be redrawn in the style of famous paintings from Van Gogh or Picasso. More recent generative models (Generative Adversarial Networks [GANs]; Goodfellow et al., 2014) are even capable of matching the style of entire collections of artworks, as shown by Zhu et al. (2017), who used collections of paintings by Monet, Cezanne and Van Gogh to redraw landscape photographs to match the respective painter's style. While GANs are advanced methods that originate in Machine Learning, other methods like the approach by Malo and Simoncelli (2015) focus more on using physiologically plausible architectures to generate images with similar textures. This latter approach is likely to have more explanatory power because it makes use of mathematical tools that are more directly related to findings from vision science.

# 2.3. Other Applications

In the previous sections, we described computational methods to predict ratings and to discriminate between paintings by different artists and art styles. Most of these methods rely of the perceptual distinctness of different types of artworks. However, art has also been studied from other perspectives. In the present section, we review computational methods that can provide useful help in solving questions relevant to art history as well as art forgery detection. Some of these methods aim to discriminate rather subtle differences between artworks that may not even be apparent to the human eye.

For a review on earlier methods, see Stork (2009a). A more recent overview is given in Spratt and Elgammal (2014), who list different applications and publications of computational methods for art analysis, including semantic annotation of artworks, ordering of paintings by creation date, or the detection of similarities in paintings and artists in order to reveal mutual influences between artists.

#### 2.3.1. Art History

Among the methods that address art historical questions, we can discern two areas of interest. First, some researchers have developed computational methods to study artistic technique. Second, the influence of a painter on the style of other artists has been studied.

Criminisi et al. (2002) developed methods for investigating the perspective and the reconstruction of the 3-dimensional space from realistic paintings. This information can help art historians to answer spatial questions like, for example, to determine the height of people or objects that are depicted in paintings. In another study, Criminisi and Stork (2004) analyzed inaccuracies in the perspective cues in a painting by Jan van Eyck and demonstrated that is it unlikely that the painter used optical aids like mirrors during the creation of the painting "Portrait of Arnolfini and his wife." Stork and Johnson (2006) applied a technique that was originally designed for detection of tampering in photographs, in order to localize light sources in paintings. They presented such an analysis for Georges de La Tour's painting "Christ in the carpenter's studio." Based on their findings, they rebutted the claim that the light source of the depicted scene lays outside the painting, which could have been an indication of the use of optical aids as well. Papaodysseus et al. (2006) investigated the use of stencils in late Bronze Age wall paintings by applying a Hough Transform (a method for finding instances of mathematically defined shapes in images), and identified a set of stencils that were likely used during creation of the wall paintings. Kim et al. (2014) propose statistical measures to quantify the usage of individual colors, their variety in a painting, and the roughness of the brightness of a painting and report significant differences for different art periods. Berezhnoy et al. (2005) studied color and texture features in paintings by van Gogh. They confirmed that the painter increasingly made use of opponent colors later in his lifetime. Later, Berezhnoy et al. (2009) proposed a method for aiding art experts in automatically extracting the orientations of brushstrokes in a painting.

The study of a painter's influence on other artists, which can be investigated by detecting similarities between images, is a popular topic of research in computational aesthetics. Bressan et al. (2008) used SIFT features and local color statistics to compute similarities between images based on a Fisher Kernel representation of the images. Shamir and Tarakhovsky (2012) used a set of 4,027 features that represented many different aspects of visual appearance (e.g., shape, texture, color) and computed a phylogeny, which shows distinct clusters for classic artists like Vermeer or Rembrandt and for modern artists like Jackson Pollock, Marc Rothko, or Wassily Kandinsky. Wang and Takatsuka (2012) extracted color and composition features, which allowed them to classify Renaissance, Impressionist and Postimpressionist paintings. Furthermore, they applied hierarchical clustering in order to identify relationships among artists and demonstrated that they can detect influences of preceding art periods on Picasso's works. Abe et al. (2013) proposed a framework for determining artistic influences based on the semantics of images. By using classeme features to compute distances between images (Torresani et al., 2010), they succeeded in identifying novel cases where one artist influenced another, which had not been considered by art historians before. Elgammal and Saleh (2015) approached the problem of assessing creativity in terms of the originality of an artwork and represented influences and originality as a graph. Relying on classemes for subject matter and GIST features for compositional aspects, they computed a creativity score for each painting in comparison to contemporary artworks.

#### 2.3.2. Forgery Detection

Another example where computational methods can help art historians is in the detection of forgeries, which is a problem closely related to artist identification. In artist identification, the works of an artist are identified among many others that usually possess rather different characteristics, which are often obvious even to laymen. However, when detecting forgeries, any differences may no longer be as easy to spot so that the task may be difficult even for art experts. Both approaches aim at identifying unique features of an artist, but an algorithm, which works well for artist identification, may not work as well for authentication and vice versa.

For example, Lyu et al. (2004) performed a wavelet decomposition of eight works attributed to the Renaissance painter Pieter Bruegel the Elder and five imitations of his work. From the wavelet statistics, they extracted a feature vector for subimages of each image and performed authentication by measuring distances between these high-dimensional points. They found that imitations of Bruegel's works differ significantly from authentic paintings. In another application of their technique, they solved the problem of "many hands." Here, art historians are interested in how many different painters contributed to one particular painting. Using their method, they were able to identify at least four different painters for face depictions in an image attributed to Pietro Perugino, a notion that is shared by art historians. Polatkan et al. (2009) introduced a new dataset of images that included originals and purposely copied paintings. Using the parameters of a Hidden Markov Model trained on wavelet coefficients, they succeeded in discriminating the copies from the originals. Li et al. (2012) studied the brushstrokes of paintings by Vincent van Gogh and used them for comparison with contemporaries and forgeries, as well as for dating different periods of van Gogh's work. Johnson et al. (2008) summarize different approaches by three research groups for discriminating between 82 original van Gogh paintings, 6 non-original works, and 13 paintings of questionable authorship. All approaches are based on a wavelet decomposition of the images.

The work of American painter Jackson Pollock has received particular interest from the scientific community. Taylor et al. (1999) performed a fractal analysis of the artist's drip paintings and found that the fractal dimension, computed using a box-counting approach, increased over the artist's lifetime. The authors suggested that this method could be used for authenticating or dating individual works by the artist. Taylor's approach was criticized by Jones-Smith and Mathur (2006), who showed that they could easily generate images that had the same fractal properties albeit not being similar to Pollock's paintings in their aesthetic value. Stork (2009b) later defended Taylor and colleagues and argued that, while one feature in isolation may not be sufficient for the analysis, a combination of multiple fractal measures can provide useful information. Shamir (2015) used a set of features from biological image analysis (Shamir et al., 2008) and reported an accuracy of 93.0% in discriminating between original and non-original drip paintings.

Hughes et al. (2010) applied a sparse coding scheme in order to compare authentic Bruegel paintings with works by imitators. They demonstrated that their technique can be used to discriminate between authentic and non-authentic Bruegel drawings. Olshausen and DeWeese (2010) suggested that the methods of detecting forgeries brought forward by Hughes et al. (2010) could be useful not only in learning styles of particular artists but also for using these statistics to generate novel images. Montagner et al. (2016) proposed a system for forgery detection of paintings by the Portuguese painter Amadeo Souza-Cardoso. In their approach, they combined a brushstroke analysis using SIFT features on RGB images and an analysis of the pigments in the painting by hyperspectral imaging. Using a dataset of 12 images, among which one was not painted by the artist, they successfully determined the authenticity of the original paintings.

In summary, computational methods can provide support for art historians who study individual paintings or artists. Computational methods have aided art historians in multiple ways, for example by enabling them to detect the use of practical aids like stencils or projectors in the creation of an artwork. Furthermore, telling forgeries from originals as well as the dating of an artist's work can be improved with the help of algorithmic approaches. Other applications are the exploration of hitherto unknown influences between artists.

# 3. EXPERIMENTAL AESTHETICS: INVESTIGATION OF SPECIFIC IMAGE PROPERTIES

In experimental aesthetics, researchers are not primarily interested in reaching automatic decisions that mimic human aesthetic judgments. Rather, the goal is to find out on what grounds aesthetic judgement are made by human observers and what their biological basis and evolutionary purpose might be. In other words, applications are not the focus of research, but rather a better understanding of aesthetic experience (Berlyne, 1974; Cela-Conde et al., 2011; Chatterjee and Vartanian, 2014; Shimamura, 2014). Before proceeding to concrete examples, we will briefly review some key concepts in experimental aesthetic research.

## 3.1. Basic Concepts in Experimental Aesthetics

It is generally agreed that aesthetic experience is a highly complex phenomenon and involves at least three key domains (perception, cognition and emotion), which are realized at multiple levels of human social organization (universal, cultural and individual) (Jacobsen, 2006; Markovic´, 2012; Chatterjee and Vartanian, 2014; Redies, 2015).

To a large extent, perception represents bottom-up processing of visual information. Perceptual mechanisms are thought to be universal among humans and are likely to have their origin in the evolution of the human visual system. Whereas it is self-evident that any information associated with a visual stimulus must be processed by the visual system in order to be perceived, it is still a matter of debate whether there are specific mechanisms that mediate the perception of aesthetic (or beautiful) stimuli at lower or mid-levels of visual processing.

On the one hand, it has been demonstrated that visually pleasing images are associated with specific image features that can be measured by objective means. Because artworks of different styles, cultures and artists differ in their content, these common image properties reflect formal characteristics of images (significant form; Bell, 1914). Possibly, these stimulus properties elicit a particular state of neural activity in the visual system (resonance; Taylor et al., 2005; Redies et al., 2007b) or induce the activation of a specific (beauty-responsive) neural mechanism in receptive individuals (Redies, 2015). This specific activation can be thought of as the correlate of visual preference or, more specifically, of the perception of beauty in images.

On the other hand, it has been argued by some modern philosophers, art critics, psychologists and neuroscientists that any visual stimulus can elicit an aesthetic experience, as long as it is presented in an appropriate cultural context. Followers of this cognitive hypothesis often reject the notion that there are objective and universal stimulus properties that characterize aesthetic stimuli. Instead, they emphasize the role of the arthistorical context of artworks, the intentions of the artists, conceptual issues, the expertise of the beholder, the status of the artwork and other culturally determined factors (Danto, 1981; Leder et al., 2004; Zeki, 2013; Gopnik, 2014). These factors are, by definition, not universal and do not persist over time, because cultural conditions change perpetually; they reflect cognitive (predominantly top-down) mechanisms in the human brain and relate more to the content and context of artworks than to their form. However, perceptual (sensory) and cognitive factors are not mutually exclusive in aesthetic appreciation; several researchers have included combinations of both types of factors in their models of aesthetic experience (for example, see Jacobsen, 2006; Locher et al., 2007; Markovic, 2012; Chatterjee and Vartanian, ´ 2014; Kozbelt and Kaufman, 2014; Shimamura, 2014; Redies, 2015).

Individual experiences also play an important role in aesthetic experience, both in terms of short-term adaptation to the beauty of visual stimuli and in long-term processes, such as familiarization and the acquisition of knowledge about art. Interestingly, interindividual differences have been found even in the preference for basic stimulus properties, such as stimulus complexity (Bies et al., 2016a; Güçlütürk et al., 2016; Lyssenko et al., 2016; Spehar et al., 2016), color (Mallon et al., 2014; Palmer et al., 2016), or the preference for the aspect ratio of rectangles (McManus et al., 2010). Last but not least, the emotions of the beholder also play an important role in aesthetic appreciation (Leder et al., 2004, 2014; Silvia, 2005, 2014).

Against this background of concepts in experimental aesthetics, it is clear the identification of objective image properties in computational aesthetics can provide an important basis for the understanding of aesthetic perception. Indeed, the notion that aesthetic stimuli are endowed by objectively measurable properties that can be universally recognized and are preferred by humans across cultures seems implicit in many studies in computational aesthetics. However, the knowledge about other factors that depend on the cultural context of individual artworks, on the intentions of the artists and on the cognitive and emotional state of the beholder should make us cautious when confronted with claims that particular image properties are universally preferred across individuals, groups of people or cultures.

A major research topic of experimental aesthetics is the investigation of the specific properties of artworks. This research allows us to gain insight into how aesthetic perception is linked to human vision and contributes to our knowledge on how we perceive the world (Graham and Redies, 2010). In the field of experimental aesthetics, researchers have studied a wide variety of aesthetic experiences, ranging from deeply moving emotions elicited when viewing famous artworks in a prestigious museum, to aesthetic ratings of artworks in a laboratory setting, and to visual preferences for simple artificial patterns displayed on a computer screen. This wide range of aesthetic experiences brings up two issues. First, beyond statistical image properties, cultural, social and psychological factors play an important role in aesthetic experience. Undoubtedly, these factors interact with image properties that characterize artworks. Second, the role of specific image properties may depend on the type (or the intensity) of the aesthetic experience studied. For example, if an image property plays a role in aesthetic preference of simple, computer-generated patterns in a laboratory experiment, the same property may not necessarily influence the aesthetic appreciation of high-quality artworks in a museum (or the classification of photographs in a computational study). With these caveats in mind, we will describe several image properties that have been associated with aesthetic experience in the following sections. Again, we do not strive for completeness, but rather review selected examples that seem particularly instructive, with a focus on artworks and photographs.

#### 3.2. Luminance and Color Statistics

The distribution of luminance, color and contrast belong to the low-level image properties that can affect the preference ratings of photographs. For example, Graham and Field (2008) showed that luminance statistics differ between artworks and natural scenes, as do their optical properties. By manipulating luminance statistics in a variety of natural images, including artistic photographs of landscapes, Graham et al. (2016) found that humans prefer images of low skewness (i.e., the third statistical moment) of their luminance distribution, with roughly equal proportions of light and dark in the images. Indeed, artworks tend to have lower-skew luminance histograms than photographs of real scenes across cultures and time periods (Graham and Field, 2007). The authors argue that artists use a non-linear compression to obtain low skewness in their paintings because images with this property can be more efficiently processed by the visual system.

Color is a feature that has been frequently used in classifiers in the field of computational aesthetics (see section 2.1.1). Although it is clear that color contributes much to aesthetics of visual art, there have been relatively few studies on color in experimental aesthetics. For example, by manipulating color statistics of Renaissance paintings, Pinto et al. (2006) studied lighting conditionsthat viewers consider optimal; they found that human observers generally prefer illumination conditions that yield increased chromatic diversity. Palmer and Schloss (2010) studied human aesthetic preferences for color, using simple visual stimuli. In their ecological valence theory, they suggest that color preferences arise from the affective responses to color-associated objects. In other words, people like colors that are associated with objects they like. In how far these results generalize to artworks remains unclear. Mallon et al. (2014) observed that participants preferred specific combinations of color measures in abstract artworks and that this aesthetic preference is subject to short-term visual adaptation.

In the field of computational aesthetics, Leykin and Cutzu (2003) compared the occurrence of color and luminance intensity edges in paintings and photographs of real scenes. Their results indicated that, in paintings, there are significantly more coloronly edges than in photographs of real scenes. Moreover, color edges and intensity edges tend to coincide less frequently in paintings than in photographs of real scenes. Cutzu et al. (2005) build a classifier that combined color, edge and texture properties and distinguished artworks and photographs with 90% accuracy.

Aragón et al. (2008) studied the distribution of luminance in Vincent van Gogh's "Starry Night" and other paintings by the artist. Interestingly, the distribution of luminance fluctuations in some of these images resembled the mathematical distribution of fluid turbulence, as described by the Russian mathematician Andrei Kolmogorov. The authors speculated that the painter might have unwittingly introduced this property in order to produce a special feeling of unease and motion.

# 3.3. Complexity

Complexity relates the subjective impression of how many pictorial elements are contained in a visual stimulus. This property has been studied extensively, both in computational aesthetics and in psychological experiments. Complexity has been captured by a multitude of statistical measures, such as the number of visual elements in an image (Birkhoff, 1933), the fractal dimension (Mureika, 2005; Taylor et al., 2011), GIF compression (Forsythe et al., 2011), overall luminance gradient strength (Braun et al., 2013), or edge density (Redies et al., 2017).

In his seminal work on aesthetics, Berlyne (1974) suggested that images with an intermediate degree of complexity are preferred by humans over images of low or high complexity. His interpretation of the inverted u-shaped relation between beauty and complexity was that preference and interest increase steadily with visual complexity until a maximal level of affective appraisal is reached. With a further increase in complexity, appraisal decreases again because of decreasing preference. Others have argued that humans prefer an intermediate visual complexity because our ancestors lived in a savanna-type landscape of similar complexity (for a review, see Forsythe et al., 2011). The relationship between liking and stimulus complexity is subject to considerable interindividual variability, at least for artificial images (Jacobsen and Höfel, 2002). By automatically clustering the participants, Güçlütürk et al. (2016) described that, for one group of participants, liking decreased as stimuli became more complex, while another group exhibited the opposite pattern of preference (i.e., higher liking for more complex stimuli). Bies et al. (2016a) obtained similar results by investigating preference ratings for exact (mathematical) fractal patterns. They also described that their measure of complexity (fractal dimension) interacted with symmetry and recursion of their stimuli.

Rigau et al. (2008) took Birkhoff's aforementioned idea of aesthetics being a trade off between order and complexity, and proposed different global measures based on principles from information theory and Kolmogorov complexity. The authors applied these measures to nine paintings by van Gogh, Seurat, and Mondrian.

# 3.4. Symmetry, Balance and The Rule of Thirds

Symmetry is a well-established property that plays a prominent role in the perception of many natural and artificial patterns. Symmetry can be perceived at a glance and can affect visual detection, attention, eye movements and physiological arousal (Locher and Nodine, 1989). Not surprisingly, several studies have demonstrated that symmetry is involved also in aesthetic perception. A particularly well-known example is the perception of attractiveness of human faces (Grammer and Thornhill, 1994). In simple geometrical (graphic) and ornamental patterns, symmetry was shown to have a high correlation with aesthetic judgements (Jacobsen and Höfel, 2002; Westphal-Fitch et al., 2013; Rampone et al., 2016; al Rifaie et al., 2017). However, the role of symmetry in photography and artworks seems less clear. The visitor to any art museum will readily realize that simple types of geometrical symmetry (reflectional, translational or rotational) are not general principles of composition in traditional visual art, although symmetry can attract attention if present in a painting (Locher and Nodine, 1989). Accordingly, studies that link symmetry to the aesthetic appreciation of artworks are infrequent (Osborne, 1986). It has therefore been suggested that the link between symmetry and attractiveness/beauty is domainspecific (Little, 2014).

The century-old concept of pictorial balance is related to symmetry, but on a more complex level. Unlike symmetry, it is considered to be an important and universal factor that contributes to the aesthetic appreciation of most types of images, including abstract visual patterns, photographs and artworks (McManus et al., 1985; Gershoni and Hochstein, 2011; Jahanian et al., 2015). According to Arnheim's Gestalt theory of visual balance (Arnheim, 1954), an image is balanced if the center of the displayed attractions is placed on any of the major axes of the image (vertical, horizontal and diagonal). There are different ways to measure balance. For example, in their study on Arnheim's theory, McManus et al. (2011a) used a physicalist approach and measured the center-of-mass of the luminance values in images. They considered an image more balanced if the center-of-mass was closer to the geometrical center of an image. Overall, the authors did not find evidence to support Arnheim's theory when they compared art photographs to photographs that were randomly taken, or when they studied simple geometrical figures. Jahanian et al. (2015) took another approach and modeled pictorial balance in terms of the visual weight of several low-level visual features that are used to calculate visual saliency. In a large set of 120,000 images that were rated highly, the saliency-based image hotspots aligned with Arnheim's axes, thus confirming his theory. A similar difference was obtained in a study on photographic cropping. The details of photographs that were preferred during cropping showed a more balanced saliency distribution than the details that were avoided during cropping (Abeln et al., 2016); no such difference was observed for luminance-based balance McManus et al. (2011b). Some of the computer algorithms that predict ratings of photographs and artworks (see section 2.1.1) incorporate measures of pictorial balance in their calculations (for example, see Ke et al., 2006; Li and Chen, 2009).

The rule of thirds, which is a principle of composition avidly followed in photography, seems to contradict the notion that the major axis of an image play a significant role in balance; it stipulates that salient compositional elements are to be placed close to one of the third lines of the image in order for images to be aesthetically pleasing. The rule of thirds has been used in many computational methods to predict ratings of photographs and artworks (for example, see Datta et al., 2006; Luo and Tang, 2008; Li and Chen, 2009). However, experimental studies did not confirm the significance of this rule in high-quality photographs (Amirshahi et al., 2014a) or "selfie" photographs (Bruno et al., 2014).

# 3.5. Fourier Spectral Properties

Graham and Field (2007) and Redies et al. (2007b) compared the Fourier spectral properties of natural scenes and images of Western artworks. They found that both types of stimuli share a scale-invariant amplitude (or power) frequency spectrum and both have a similar slope in log-log plots. Similar results were obtained for artworks of East Asian provenance (Graham and Field, 2008) and for other visual stimuli that were created to please the human eye, such as cartoons, comics and mangas (Koch et al., 2010). In contrast, several types of non-art images, such as photographs of simple objects and plants, do not possess this property (Redies et al., 2007b). Notably, photographs of faces portraits have steeper slopes of the log-log plots than human portraits drawn by artists (Redies et al., 2007a). Mather (2014) compared the spectral slopes of 31 artworks with those of closely matching photographs. He found that artists compress the spectral slopes of their works to a relatively narrow range compared to the slopes of the photographs and proposed that the artist's visual system plays a central role in adjusting the spectral slope of artworks. Humans observers tend to prefer artificial, random-phase patterns with Fourier properties similar to natural scenes (Menzel et al., 2015), but exhibit significant interindividual differences in this preference (Spehar et al., 2016). Moreover, the visual preference for these synthetic noise images correlated well with the discrimination sensitivity of the observers for different amplitude spectra of the images (Spehar et al., 2016).

Interestingly, the amplitude spectrum of many uncomfortable visual stimuli contains an excessive energy at medium spatial frequencies and thereby deviates from the linear spectral properties of natural scenes and images of artworks that are perceived as pleasant (Fernandez and Wilkins, 2008; O'Hare and Hibbard, 2011). The Fourier spectral slope of images correlates with measures of image complexity (Table S1 in Redies et al., 2017), in particular with the fractal dimension (Bies et al., 2016b). A shallower slope indicates more power in the high-frequency part of the spectrum; consequently, the images show more fine detail and thus higher complexity.

Schweinhart and Essock (2013) analyzed the Fourier spectral properties in landscape paintings that were produced by a group of local artists, and compared them to photographs of the scenes, which the artists had painted. They asked whether the well-known oblique effect can be observed in paintings. The oblique effect refers to the fact that, in our natural environment, cardinal (horizontal and vertical) edge orientations are more prominent than oblique orientations. In the Fourier domain, this difference translates into stronger amplitudes for cardinal vs. oblique orientations. In the natural environment, this effect is observed only for the lowest spatial frequencies but not for high spatial frequencies. However, the artists implemented the oblique effect also at high spatial frequencies, thus overregulating this image property in their works.

### 3.6. Fractals and Self-similarity

The work of the abstract expressionist artist Jackson Pollock (1912–1956) has received particular interest from the scientific community. Taylor performed a fractal analysis of the artist's drip paintings using a box-counting approach and found that Pollock's paintings are not chaotic but possess a fractal structure (Taylor, 2002). This surprising finding prompted a series of investigations of human responses to fractals, which are not only prevalent in nature but can also be found in geometric and mathematical patterns produced by humans. The studies included behavioral investigations, studies of physiological responses, eye tracking and brain imaging studies (Taylor et al., 2011; Taylor and Spehar, 2016). Converging evidence from these studies indicate that both natural and artificial fractals of mid-range complexity (as measured by the fractal dimension) elicit favorable physiological responses and are thus preferred by human observers (see also section 3.3). Fractals have even been shown to reduce stress levels in the observers (Taylor, 2006) and it has been suggested that the beneficial effect of fractal patterns can enhance architecture and our urban environment (Joye, 2007). However, as already observed by Aks and Sprott in their seminal study on chaotic visual patterns (Aks and Sprott, 1996), there are large interindividual differences in human responses to fractals and their complexity (see section 3.3). Interestingly, Pollock created fractal structure in his artworks long before fractal geometry was described and studied in detail in the 1970ies (Mandelbrot and Pignoni, 1983); he must have followed this principle intuitively and without explicit cognitive control. As noted by Alvarez-Ramirez et al. (2008), the finding that Pollock's drip paintings possess fractal structure is closely related to its scale-invariant spectral properties (see section 3.5).

The fractal-like structure of artworks was studied also by Amirshahi et al. (2012) who derived a measure for self-similarity in images, based on a Pyramid Histogram of Oriented Gradients (PHOG) representation of images (Bosch et al., 2007). In this approach, images are self-similar if the Histograms of Oriented Gradients (HOGs) of parts of an image resemble the HOG of the entire image. Redies et al. (2012) applied this measure to different image categories, ranging from natural scenes to man-made stimuli and artworks, including a large and diverse sets of traditional paintings of Western provenance (Amirshahi et al., 2014b). For artworks and most natural patterns, Redies and colleagues reported an intermediate to high self-similarity, whereas other patterns, such as images of simple objects, faces of buildings, were less self-similar.

Both lines of evidence suggest that traditional artworks share specific stimulus properties with our natural environment. Our visual system has adapted to these properties in evolution so that it can process them with a sparse (efficient) code in order to save computational and metabolic resources (Simoncelli and Olshausen, 2001). It has therefore been suggested that artworks are created so that they can be processed efficiently/sparsely by the human visual system (Redies, 2007; Renoult et al., 2016). The concept of sparse coding is familiar also to researchers in computer vision (Mairal et al., 2014). Akin to the efficient coding hypothesis is the idea that artworks can be processed fluently and therefore evoke a pleasant feeling in human observers (Reber et al., 2004). The fluency concept has its origin in the field of psychology; the underlying neuronal mechanism and possible coding strategies in the human brain remain unspecified to date.

# 3.7. Regularities in the Orientation of Luminance Gradients, Edges, and Lines

In a study on large subsets of traditional Western artworks, histograms of oriented gradients (HOGs; see section 3.6) were found to possess a surprising regularity (Redies et al., 2012; Braun et al., 2013): Artworks possess a relatively uniform spectrum of luminance gradient (edge) orientations. This result implies that all edge orientations in the artworks tend to be similarly prominent. In other words, anisotropy of edge orientations is low in artworks. Other types of images with low anisotropy can be found in nature (for example, large vista scenes and images of plants, lichen growth patterns, branches and clouds; Redies et al., 2012). Anisotropy is larger in images of simple objects, including faces, and other man-made patterns, such as advertisements, building facades and urban scenes, due to the relative prominence of single or a few orientations. For example, horizontal and vertical orientations predominate in images of building facades.

The finding of low anisotropy of edge orientations in artworks was recently confirmed and extended by Redies et al. (2017), who studied edge orientations in different categories of images, including traditional artworks of different cultural provenance (Western, Islamic and East Asian). They showed that the art images possess a more uniform histogram of edge orientations across cultures than many non-art types of images, in particular, photographs of man-made objects and scenes. This result mirrors the low anisotropy found in artworks (see above). In addition, by pairwise comparison of edge orientations across each image, Redies and colleagues found that edge orientations are independent of each other across art images, except for edge pairs at short distances, which tend to be collinear. In other words, the edge orientation at one position of an image does not allow predicting the orientations of distant edges at other positions in the same image. Similar statistical regularities of edge orientations are observed in some natural images, such as lichen growth patterns. This property is independent of cultural provenance, artistic genre or technique, or image content of the artworks studied. The authors speculated that this regularity might relate to the notion of "good composition" (Arnheim, 1954) or "visual rightness" (Locher et al., 1999), which has been advanced for traditional artworks.

Another regularity with respect to the perception of contours is that smoothly curved lines and objects are generally preferred over sharply angular ones (Gómez-Puerto et al., 2015). Interestingly, humans share this preferences not only across cultures but also with great apes (Munar et al., 2015). As a possible explanation, Bar and Neta (2006) proposed that sharp transitions in contour convey a sense of threat in the observer and are therefore disliked. However, Bertamini et al. (2016) questioned this notion and provided experimental evidence that humans prefer curvature due to its intrinsic characteristics and not because they reject the threat potential of angular contours.

### 4. CONCLUSION AND OUTLOOK

In recent years, computer vision has successfully contributed computational methods to the evaluation of photographs and digitally reproduced artworks. In the present work, we discussed recent progress in this field, which has become known as computational aesthetics. Specifically, we reviewed methods that were developed to predict the aesthetic rating of photographs and artworks by computational approaches. For artworks, we provided an overview on applications of computational algorithms to artist identification, style prediction, art historical questions, and forgery detection.

In general, researchers in the computer vision community tend to measure success by comparing different methods regarding their accuracy of classification or prediction. When using the same database, systems can easily be compared and finding the best working approach is straightforward. However, with recent advances in technology, algorithmic and larger datasets, the best-performing classifiers have become black boxes and their discrimination boundaries are no longer obvious. From an application standpoint of view, this is not necessarily a limitation. For example, such systems can be readily deployed in image processing pipelines to identify images of high vs. low aesthetic value. While early methods where restricted to the formal aspects of a scene, more advanced methods, like Deep Neural Networks, can take into account the content of images as well. It was shown that the inclusion of content results in major improvements, because different stylistic elements come along with different content matter. For example, bright colors are usually more pronounced in pleasant images that depict fresh fruits than in gloomy images of street scenes at night. Such combinatorial information can improve classification results.

Lately, computational methods have gained increasing popularity also in the field of experimental aesthetics, an area of research that has a long tradition as a branch of psychology and, more recently, of neuroscience. In experimental aesthetics, the focus is not on improving algorithms for rating prediction systems or identifying artists or artistic styles, but rather on gaining a better understanding of what specific stimulus properties induce human observers to reach judgements on beauty and to have an aesthetic experience. For example, as discussed in section 3, converging evidence suggests that some global image properties that also characterize natural scenes can be found in large subsets of traditional artworks.

With recent developments in Deep Learning, it has become harder to share knowledge between computational aesthetics and experimental aesthetics. In the early days, insights from the active field of experimental aesthetics provided a wealth of knowledge, also for computational aesthetics. This knowledge resulted in the development of computational algorithms based on handcrafted features, which were known (or suspected) to contribute to the aesthetic appeal of an image. During this time, empirical aesthetics also profited greatly from the computational methods because, for the first time, very large datasets of images could be analyzed, rather than the small number of images that are usually tested in psychological experiments with human observers. However, with Deep Learning, it has became harder for empirical aesthetics to catch up with the computational approaches. Deep Learning models basically represent black boxes, which prevent insight into what features they learn and how they use them to evaluate the aesthetic quality of images, which is the main motivation for empirical aesthetics. In future work, it will therefore be essential to gain a better understanding and interpretability of the decision boundaries that the computational models draw, in order to identify concrete properties of human aesthetic preference. Moreover, recent generative models from computer vision (Gatys et al., 2016) are capable of producing synthetic images that match the style of famous painters, and are no longer discriminative only. This generative approach may provide researchers with well-controlled stimuli for testing human observers in experimental aesthetics.

In conclusion, much can be learned if the two areas of aesthetic research can be recombined, taking advantage of the methodological advances in computational aesthetics and the identification of perceptual mechanisms in experimental aesthetics. As an example, we recently investigated the variability of CNN feature responses to traditional artworks and non-art images and found that the two categories of images can be separated by a classifier that is based on only two variance values (Brachmann et al., 2017). However, results for some styles of (post-)modern and contemporary art clearly deviated from traditional art. The investigation of differences between art styles may therefore be of particular interest in the future, not only in computational aesthetics but also in experimental aesthetics. Moreover, in view of the interindividual differences in aesthetic preferences (see section 3.1), cultural diversity will be an important issue in future research.

# AUTHOR CONTRIBUTIONS

AB and CR conceived this review, carried out the literature search and wrote the manuscript.

### FUNDING

This work was supported by funds from the Institute of Anatomy, Jena University Hospital.

# REFERENCES


Birkhoff, G. D. (1933). Aesthetic Measure. Cambridge: Harvard University Press.

Bosch, A., Zisserman, A., and Munoz, X. (2007). "Representing shape with a spatial pyramid kernel," in Proceedings of the 6th ACM International Conference on Image and Video Retrieval (New York, NY: ACM), 401–408.


brushstroke extraction. IEEE Trans. Patt. Anal. Mach. Intell. 34, 1159–1176. doi: 10.1109/TPAMI.2011.203


Luo, Y., and Tang, X. (2008). "Photo and video quality evaluation: focusing on the subject," in European Conference on Computer Vision (Berlin; Heidelberg: Springer), 386–399.


IEEE International Conference on the Image Processing (ICIP) (Phoenix, AZ: IEEE), 3703–3707.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Brachmann and Redies. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# **Neuronal Mechanism for Compensation of Longitudinal Chromatic Aberration-Derived Algorithm**

#### *Yuval Barkan<sup>1</sup> and Hedva Spitzer <sup>2</sup> \**

*<sup>1</sup>Biomedical Engineering Department, Faculty of Engineering, Tel Aviv University, Tel Aviv, Israel, <sup>2</sup> Electrical Engineering School, Faculty of Engineering, Tel-Aviv University, Tel-Aviv, Israel*

The human visual system faces many challenges, among them the need to overcome the imperfections of its optics, which degrade the retinal image. One of the most dominant limitations is longitudinal chromatic aberration (LCA), which causes short wavelengths (blue light) to be focused in front of the retina with consequent blurring of the retinal chromatic image. The perceived visual appearance, however, does not display such chromatic distortions. The intriguing question, therefore, is how the perceived visual appearance of a sharp and clear chromatic image is achieved despite the imperfections of the ocular optics. To address this issue, we propose a neural mechanism and computational model, based on the unique properties of the *S*-cone pathway. The model suggests that the visual system overcomes LCA through two known properties of the *S* channel: (1) omitting the contribution of the *S* channel from the high-spatial resolution pathway (utilizing only the *L* and *M* channels). (b) Having large and coextensive receptive fields that correspond to the small bistratified cells. Here, we use computational simulations of our model on real images to show how integrating these two basic principles can provide a significant compensation for LCA. Further support for the proposed neuronal mechanism is given by the ability of the model to predict an enigmatic visual phenomenon of large color shifts as part of the assimilation effect.

#### *Edited by:*

*Hagit Hel-Or, University of Haifa, Israel*

#### *Reviewed by:*

*Inyoung Kim, Virginia Tech, United States Hauke Busch, University of Lübeck, Germany*

> *\*Correspondence: Hedva Spitzer hedva@eng.tau.ac.il*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology*

*Received: 07 October 2017 Accepted: 23 January 2018 Published: 23 February 2018*

#### *Citation:*

*Barkan Y and Spitzer H (2018) Neuronal Mechanism for Compensation of Longitudinal Chromatic Aberration-Derived Algorithm. Front. Bioeng. Biotechnol. 6:12. doi: 10.3389/fbioe.2018.00012* **Keywords: aberration, chromatic adaptation, compensatory mechanisms, computer model, visual perception**

# **INTRODUCTION**

The human eye is affected by the imperfections of its optics, which degrade the quality of the retinal image and ultimately impose limits on vision. These imperfections have both spatial and chromatic implications. One of the most dominant chromatic implications is the phenomenon of longitudinal chromatic aberration (LCA). LCA is a significant and dominant attribute of the visual system and has been studied and measured extensively (e.g., Bedford and Wyszecki, 1957; Charman and Jennings, 1976).

Longitudinal chromatic aberration is induced by the dependence of the refractive power of the lens on wavelength. As can be seen in **Figure 1**, the ocular refractive power is higher for shorter wavelengths (Bedford and Wyszecki, 1957). The accommodation mechanism of human eyes can determine the focus for each wavelength, but it is impossible to bring all of the wavelengths to focus simultaneously (Wandell, 1995). The phenomenon of LCA has been measured extensively, both by psychophysically (Wald and Griffin, 1947; Ivanoff, 1953; Bedford and Wyszecki, 1957; Jenkins, 1963;

**FIGURE 1** | Comparison of refractive power (chromatic shift) reported by several studies. Note that the chromatic shift is much larger for the short wavelengths (blue photoreceptor) than for the long wavelengths (red photoreceptor). All the data are adjusted vertically to have a zero value at the reference wavelength of 589 nm giving the longitudinal chromatic aberration a refractive power of about two diopters. This image has been taken with permission from The Optical Society (Chen et al., 2003).

Howarth and Bradley, 1986) and retinoscopy methods (Charman and Jennings, 1976; Rynders et al., 1998). These studies showed that LCA has a refractive power of about two diopters (*D*), across the visible spectrum (**Figure 1**).

An alternative method of representing the chromatic aberration is through the modulation transfer function (MTF), which describes the sensitivity as a function of the spatial frequency and the wavelength. Due to the LCA, the MTF of the *S*-cone (blue) channel has a lower frequency cutoff (by a factor of 3–5) than the MTF of the *M*/*L* cone channels (red–green) (Shevell, 2003).

An additional factor that limits the visual acuity of the *S*pathway is the low density of the *S* photoreceptors at the retinal mosaic. It is plausible that this low density has evolved in the visual system, in order not to have more sensors than the optical MTF can utilize. The MTF thus would be limited by both the LCA and photoreceptor density which, as mentioned above, are not independent factors. Calkins (2001) showed that the *S*-cone density can be a consequence of efficient Nyquist sampling: "*. . .*the eye's optics together with what may be called 'typical' viewing conditions effectively limit any evolutionary pressure to pack S cones into the photoreceptor mosaic with a Nyquist rate greater than about 7–8 cycles deg-1." If we approximate the *S* mosaic as triangular for ease of calculation, this sampling rate would correspond to an upper limit of foveal density in the human retina of 2,000–2,500 S cones mm-2. Various anatomical measurements of the distribution of *S* cones in the human retina, both direct and indirect, converge to a similar estimate: *S* cones peak in density at about 2,000 cells mm-2, just outside the center fovea, representing 5–10% of the cone population (Curcio et al., 1991).

The consequence of the LCA is that the retinal image will be focused only for the "green" wavelengths, and for the most part will be out of focus for the bluish wavelengths. The consequent image would be expected to have colored borders ("fringes")—similar to that seen with a cheap lens (Valberg, 2005). Although it is not possible to remove these chromatic defects from a lens, an efficient optical system should be designed to minimize the distortion caused by the LCA. For example, it is possible to correct chromatic aberration through a combination of two or more lenses, in such a way that the aberration of each lens compensates for the aberration of the other lens (achromatic lens). In the human visual system, this solution is impractical since we are continuously changing the focal distance.

A recent proposal suggests that Müller glial cells may play a role in reducing the chromatic aberration due to the fact that peripheral light at larger tilt angles will be rejected more readily (Labin and Ribak, 2010). Another suggestion is that the shortwavelength absorbing pigments of the ocular media may have a function in limiting the chromatic aberration (Walls, 1963; Nussbaum et al., 1981). However, spectral filtering in the ocular media has a relatively small effect on the MTF (Shevell, 2003) and none of these optical features (Walls, 1963; Labin and Ribak, 2010) is sufficient to explain the lack of perceived distortion at sharp achromatic edges.

It is therefore intriguing to understand how notwithstanding the imperfections of the ocular optics, including the LCA, the perceived visual appearance is still a sharp and clear image. Since the optical system of the eye cannot apparently account for the correction, it is reasonable to suppose that the neuronal system acts to reduce the distortion (Shevell, 2003; Valberg, 2005). It should be appreciated that a non-optical system, such as the neuronal mechanism, cannot fully compensate for the optical limitations, since some of the physical information is lost. (This is exhibited by the limited MTF.)

Several studies have indeed suggested that there must be neural compensation for the eye's aberrations. Although no specific mechanism has been described (Hay et al., 1963;Artal et al., 2004), a number of compensatory options have been suggested, most of which are related to the McCollough effect (ME) (Hay et al., 1963; Broerse et al., 1999; Grossberg et al., 2002). The ME is a long-term after-effect that can last from hours up to 3 months (Jones and Holding, 1975).

The rationale to associate the ME with the LCA phenomenon derives mainly from its long-lasting temporal property, and its relation to chromatic edges (McCollough, 1965). The proposed compensatory models are composed of oriented receptive fields (RFs) (multiplexed simple cells) consisting of both chromaticand achromatic-separated subunits (Broerse et al., 1999; Grossberg et al., 2002). The elimination of the chromatic distortion is then explained by invoking a learning mechanism that inhibits the appearance of chromatic edges adjacent to achromatic edges.

These models have been supported by experiments that demonstrate that there is a long-term adaptation to chromatic aberration caused by a wedge prism. It has been demonstrated that dispersion of light passing through a wedge prism produces bluish and yellowish fringes on achromatic edges. These perceived fringes disappear when the prisms are worn for a long period of time (about 2 days) (Hay et al., 1963). This adaptation of the visual system supports the existence of a long-term corrective neural compensation mechanism.

These models can be accounted for neuronal compensation only when the chromatic aberration refractive power is constant. However, the refractive power of the LCA constantly changes due to the pupil size (that is determined by the amount of light and the accommodation of the eye). The temporal scale of pupil size change is within the range of 200–500 ms, which is faster by orders of magnitude than the neuronal adaptation mechanisms described above (which can last hours to months). Consequently, there is necessity for an additional mechanism that compensates for chromatic aberration and is less dependent on a momentary magnitude of chromatic aberration.

This means that a neural mechanism that compensates for general LCA phenomenon still remains to be discovered. If such a neural mechanism exists, it is expected that not only will it have the ability to compensate for the LCA phenomenon but will also be able to predict the visual phenomena generated by the compensation neuronal mechanism.

In this paper, we propose a plausible computational model of the retina that can compensate for LCA. The model is based on well-known retinal color-coding RFs and does not require a learning process. The validity of the suggested model is supported by its ability to predict related visual phenomena.

# **MODEL**

The model computes the perceived color in accordance with the response of retinal color-coding ganglion cells (Daw, 2012). This calculation involves two main stages. The first stage evaluates the response ganglion cells of type I (*L*/*M* and *M*/*L*, on center cells) and type II (*S*/*LM*, on coextensive cells). This stage includes the calculation of the RF response of each color-coding cell that also exhibits a remote adaptation mechanism. In addition, this stage also includes two separated pathways related to the luminance and chromatic knowledge of the two cell types. The second stage of the model proposes a novel transformation of the ganglion cell response into a perceived image by using an inverse function. The source code for the model simulation is available at https: //github.com/yubarkan/LCAcompensation/.

# **Response of the Opponent RF**

The retinal ganglion cells receive their input from the cones through several chemical and electrical processing layers (Shevell, 2003). The retinal ganglion cells then perform an adaptation of the first order. The adaptation of the first order is modeled here through adaptation of the cell inputs, rather than adaptation of the RF subregions (Spitzer and Semo, 2002; Spitzer and Barkan, 2005). We therefore define the adapted ganglion cell input signals as follows:

$$L\_{\rm pr, adapted} = \frac{L\_{\rm photo-}}{L\_{\rm photo-} + \sigma\_{\rm L} \left(L\_{\rm photo-} + L\_{\rm remote}\right)},$$

$$M\_{\rm pr, adapted} = \frac{M\_{\rm photo-} - }{M\_{\rm photo-} + \sigma\_{\rm M} \left(M\_{\rm photo-} + M\_{\rm remote}\right)},$$

$$S\_{\rm pr, adapted} = \frac{S\_{\rm photo-} - }{S\_{\rm photo-} + \sigma\_{\rm S} \left(S\_{\rm photo-} + S\_{\rm remote}\right)},\tag{1}$$

where *L*adapted, *M*adapted, and *S*adapted are the adapted inputs from the cones and σ*L,M,S* are remote and local adaptation signals and are defined as

$$\begin{aligned} \mathfrak{G}\_L &= a \cdot L\_{\text{photo}-r} + b + c \cdot L\_{\text{remove}}, \\ \mathfrak{G}\_L &= a \cdot M\_{\text{photo}-r} + b + c \cdot M\_{\text{remove}}, \\ \mathfrak{G}\_S &= a \cdot S\_{\text{photo}-r} + b + c \cdot S\_{\text{remove}}, \end{aligned} \tag{2}$$

where the remote signals are defined as

$$L\_{\text{remove}}(\mathbf{x},\mathbf{y}) = \iint L\_{\text{photo}-r}(\mathbf{x}',\mathbf{y}') \cdot f\_{\text{remove}}(\mathbf{x}-\mathbf{x}',\mathbf{y}-\mathbf{y}') \cdot \mathbf{dx}' \cdot \mathbf{dy}',$$

$$M\_{\text{remove}}(\mathbf{x},\mathbf{y}) = \iiint M\_{\text{photo}-r}(\mathbf{x}',\mathbf{y}') \cdot f\_{\text{remove}}(\mathbf{x}-\mathbf{x}',\mathbf{y}-\mathbf{y}') \cdot \mathbf{dx}' \cdot \mathbf{dy}',$$

$$S\_{\text{remove}}(\mathbf{x},\mathbf{y}) = \iint \mathcal{S}\_{\text{photo}-r}(\mathbf{x}',\mathbf{y}') \cdot f\_{\text{remove}}(\mathbf{x}-\mathbf{x}',\mathbf{y}-\mathbf{y}') \cdot \mathbf{dx}' \cdot \mathbf{dy}'.\tag{3}$$

The "remote" area is composed of an annulus-like shape around the entire RF region (Spitzer and Barkan, 2005). Its weight function (*f* remote) is modeled as a decaying exponent at the remote area as follows:

$$f\_{\text{remotor}}(\mathbf{x}, \boldsymbol{\uprho}) = \frac{1}{\boldsymbol{\pi} \cdot \boldsymbol{\uprho}\_{\text{remotor}}} \exp\left(-\frac{\boldsymbol{x}^2 + \boldsymbol{\uprho}^2}{\boldsymbol{\uprho}\_{\text{remotor}}}\right); \boldsymbol{x}, \boldsymbol{\uprho} \in \text{remotor}\\_\text{area}.\tag{4}$$

The spatial response profile of the two subregions of the retinal ganglion RF, "center" and "surround," is expressed by the known difference-of-Gaussians (DOG). It should be noted that the calculation of the DOG is performed on the adapted inputs.

The "center" signals of the two spectral regions, *L*cen, *M*cen, are defined as integrals of the adapted inputs (*L*adapted, *M*adapted; Eq. 1) over the center subregion, with a Gaussian decaying spatial weight function (*fc*):

$$L\_{\rm cen}(\mathbf{x}, \mathbf{y}) = \iint\limits\_{\rm cen-area} L\_{\rm pr\\_adapted}(\mathbf{x'}, \mathbf{y'}) \cdot f\_{\rm c}(\mathbf{x} - \mathbf{x'}, \mathbf{y} - \mathbf{y'}) \cdot d\mathbf{x'} \cdot d\mathbf{y'}, \tag{5}$$

$$M\_{\rm cen}(\mathbf{x}, \mathbf{y}) = \iint\limits\_{\rm cen-area} M\_{\rm pr\\_added}(\mathbf{x'}, \mathbf{y'}) \cdot f\_{\rm c}(\mathbf{x} - \mathbf{x'}, \mathbf{y} - \mathbf{y'}) \cdot d\mathbf{x'} \cdot d\mathbf{y'}, \tag{6}$$

$$\mathbf{(5)}$$

while *L*cen(*x*,*y*) at each location represents the subregion response of the center area, which is centered at location *x*, *y*, *. . .f<sup>c</sup>* and is defined as

$$f\_{\mathbf{k}}(\mathbf{x}, \boldsymbol{\chi}) = \frac{1}{\pi \cdot \mathfrak{p}\_{\text{cen}}} \exp\left(-\frac{\boldsymbol{x}^2 + \boldsymbol{\chi}^2}{\mathfrak{p}\_{\text{cen}}}\right); \boldsymbol{x}, \boldsymbol{\chi} \in \text{center\\_area}, \quad \text{(6)}$$

where ρ represents the radius of the center region of the RF. The "Surround" signals are defined in the same manner as follows (with a spatial weight function three times larger than that of the "center"):

$$L\_{\text{sur}}(\mathbf{x},\boldsymbol{\uprho}) = \iint\limits\_{\text{sur}-\text{area}} M\_{\text{pr\\_adapted}}(\mathbf{x}',\boldsymbol{\uprho}') \cdot f\_{\text{s}}(\mathbf{x}-\mathbf{x}',\boldsymbol{\uprho}-\boldsymbol{\uprho}') \cdot d\mathbf{x}' \cdot d\mathbf{y}',$$

$$M\_{\text{sur}}(\mathbf{x},\boldsymbol{\uprho}) = \iint\limits\_{\text{sur}-\text{area}} L\_{\text{pr\\_adapted}}(\mathbf{x}',\boldsymbol{\uprho}') \cdot f\_{\text{s}}(\mathbf{x}-\mathbf{x}',\boldsymbol{\uprho}-\boldsymbol{\uprho}') \cdot d\mathbf{x}' \cdot d\mathbf{y}',\tag{7}$$

where *f<sup>s</sup>* is defined as a decaying Gaussian over the surround region:

$$f\_s(\mathbf{x}, \boldsymbol{\chi}) = \frac{1}{\pi \cdot \mathfrak{p}\_{\rm sur}} \exp\left(-\frac{\boldsymbol{\chi}^2 + \boldsymbol{\chi}^2}{\mathfrak{p}\_{\rm sur}}\right); \boldsymbol{\chi}, \boldsymbol{\chi} \in \text{surround\\_area.} \tag{8}$$

The total weight of *f<sup>c</sup>* and *f<sup>s</sup>* is 1.

The response of the cells is expressed by the subtraction of the center and surround-adapted responses as follows:

$$L^{+}M^{-}(\mathfrak{x},\mathfrak{y}) = L\_{\text{cen}}(\mathfrak{x},\mathfrak{y}) - M\_{\text{sur}}(\mathfrak{x},\mathfrak{y}),$$

$$M^{+}L^{-}(\mathfrak{x},\mathfrak{y}) = M\_{\text{cen}}(\mathfrak{x},\mathfrak{y}) - L\_{\text{sur}}(\mathfrak{x},\mathfrak{y}).\tag{9}$$

The *S*/*LM* retinal color-coding cell is known as the small bistratified ganglion cell. The RF of this cell is known in the literature to be coextensive (type II), i.e., it has mainly chromatic opponency rather than spatial opponency (Hubel and Wiesel, 1968; de Monasterio, 1978; Derrington et al., 1984). Accordingly, the response of the *S*-cone opponent is modeled here as a type-II RF. The *S*/*LM* signal was therefore modeled through integration of the chromatic difference (*S*/*LM*) over the whole RF of this cell type:

$$\begin{split} &S^{+}L M^{-}(\mathbf{x}, \mathbf{y}) \\ &= \displaystyle \displaystyle \displaystyle \displaystyle \displaystyle \displaystyle \Big[ \displaystyle \displaystyle \displaystyle\_{\text{delocated}} (\mathbf{x}', \mathbf{y}') - \frac{L\_{\text{adcepted}}(\mathbf{x}', \mathbf{y}') + M\_{\text{adcepted}}(\mathbf{x}', \mathbf{y}')}{2} \Bigg] \\ &\cdot f\_{\text{s\\_center}}(\mathbf{x} - \mathbf{x}', \mathbf{y} - \mathbf{y}') \cdot d\mathbf{x}' \cdot d\mathbf{y}'. \tag{10} \end{split} \tag{10}$$

The spatial weight function of the RF, *fc\_*center, is defined as in Eq. 7.

#### **Transformation to Image**

The purpose of this stage is to model how the visual system transforms the RF responses to a perceived image. We suggest that in order to eliminate the effect of the blurred *S*/*LM* channel, the visual system has to very precisely exclude this channel from the processing of the high-spatial resolution channel. This suggestion is in accordance with the consensus in the literature and with accumulated evidence indicating that the chromatic information that includes the *S*/*LM* information is processed through a unique pathway, i.e., the koniocellular pathway (Hendry and Reid, 2000). Additional support for our proposal is derived from the observation that the *L* and *M* data that code high-spatial resolution information are processed independently through the parvocellular pathway (Livingstone and Hubel, 1988; Van Essen and Gallant, 1994; Hendry and Reid, 2000; Sincich and Horton, 2005).

In order to perform a transformation from the opponent signals [*L* + *M−*, *M* + *L−*, and *S* + (*L* + *M*)*−*] to perceived triplet *LMS* values, we propose a functional minimization framework. We imply that the perceived values should satisfy the following equations:

$$L^{+}M^{-} = L\_{\text{per}} - M\_{\text{surround\\_per}},$$

$$M^{+}L^{-} = M\_{\text{per}} - L\_{\text{surround\\_per}}.\tag{11}$$

*L*surround\_per and *M*surround\_per are defined in Eq. 7, but here they are related to the perceived domain rather than adapted input signals. We define the following error function:

$$E(L\_{\text{per}}, M\_{\text{per}}) = \left[L\_{\text{per}} - \left(L^{+}M^{-} + M\_{\text{surround\\_per}}\right)\right]^2$$

$$+ \left[M\_{\text{per}} - \left(M^{+}L^{-} + L\_{\text{surround\\_per}}\right)\right]^2. \tag{12}$$

This function is the square error between the estimation of *L*per, *M*per, and the satisfaction of Eq. 12. This error function can be minimized by various methods. For simplicity, we show the implication of the gradient descend method as follows (Snyman, 2005):

$$\frac{\partial L\_{\text{per}}}{\partial t} = -\frac{\partial E(L\_{\text{per}}, M\_{\text{per}})}{\partial L\_{\text{per}}},$$

$$\frac{\partial M\_{\text{per}}}{\partial t} = -\frac{\partial E(L\_{\text{per}}, M\_{\text{per}})}{\partial M\_{\text{per}}}.\tag{13}$$

Thus, we obtain the following iterative equations:

$$L^i\_{\rm per} = L^{i^{-1}}{}\_{\rm per} + dt \cdot \left[ 2 \cdot \left( L^{i^{-1}}{}\_{\rm per} - L^{+} M^{-} - M^{i^{-1}}{}\_{\rm surrounding, per} \right) \right. \tag{14}$$

$$+ 2 \cdot f\hat{s}(0,0) \cdot \left( M^{i^{-1}}{}\_{\rm per} - M^{+} L^{-} - L^{i^{-1}}{}\_{\rm surrounding, per} \right) \right],$$

$$M^i\_{\rm per} = M^{i^{-1}}{}\_{\rm per} + dt \cdot \left[ 2 \cdot \left( M^{i^{-1}}{}\_{\rm per} - M^{+} L^{-} - L^{i^{-1}}{}\_{\rm surrounding, per} \right) \right. \tag{15}$$

$$+ 2 \cdot f\hat{s}(0,0) \cdot \left( L^{i^{-1}}{}\_{\rm per} - L^{+} M^{-} - M^{i^{-1}}{}\_{\rm surrounding, per} \right) \right]. \tag{16}$$

This iteration process provides the perceived *L* and *M* values, independently of the *S*/*LM* channel (see the rationale above).

The perceived *S*-channel value (*S*per) is calculated after evaluating the *L* and *M* perceived values (Eq. 14) by using the following equation:

$$\mathcal{S}\_{\text{per}} = \mathcal{S}^{+} (L + M)^{-} + (L\_{\text{per}} + M\_{\text{per}}) / 2. \tag{15}$$

According to our model, the *S*per contributes to the perceived color and not to the perceived luminance. Thus, the perceived brightness is expressed solely by the *L* and *M* values.

#### **METHODS**

In this section, we describe the different tools and parameters used in the model simulation. The same sets of parameters were used for all the simulated images that are presented in Section "Results."

#### **Modeling Human Optics**

In order to evaluate the ability of our model to compensate for chromatic aberration, it is necessary to simulate the results from human optics on test images. We have used the Image System Engineering Toolbox for Biology ISETBIO,<sup>1</sup> which provides a unique ability to simulate human optics in a real scene.

<sup>1</sup> https://github.com/isetbio/.

For this purpose, we have used high-resolution, high-dynamic, multispectral image (HDRS) taken from the ISET High-Dynamic Range Multispectral Scene Database available by the Image Evaluation Tools.<sup>2</sup> ISETBIO also includes the WavefrontOptics code developed by David Brainard, Heidi Hofer, and Brian Wandell. Their code implements methods to model human eyes by taking adaptive optics data from wave-front sensors and calculating the optical blur as a function of the wavelength. The toolbox relies on data collected by Thibos et al. We have chosen an illumination of blackbody at 6,500 K and uses WavefrontOptics to simulate the retinal image produced by human optics. **Figure 2** is produced by this method.

# **Response of the Opponent RF**

In the first stage of the model, the adapted signals are calculated (Eqs. 1–4). The remote area was simulated as an annulus with a diameter of 35 pixels. The adaptation parameters were chosen as follows: *a* = 1, *c* = 1, representing equal strength for the local and remote adaptations (Eq. 4). The parameter "*b*," which determines the strength of adaptation (Dahari and Spitzer, 1996; Spitzer and Barkan, 2005), was taken as *b* = 3.

The calculation of surround signals (Eq. 7) was calculated with fs (Eq. 8) having a decay constant (ρ) of 3 pixels. The response of the RFs was obtained by subtracting the center and surroundadapted responses (Eq. 9).

### **Transformation to Image (Inverse Function)**

The purpose of this section is to perform a transformation from the RF responses to a perceived image. The transformation was performed using the Jacobi iterative method (Eq. 14). The iteration process was initiated (*i* = 0) by assuming achromatic stimuli. Specifically, all channels were initiated with the following values:

$$L\_{\mathrm{per}}^0 = M\_{\mathrm{per}}^0 = S\_{\mathrm{per}}^0 = \frac{L\_{\mathrm{adapted}} + M\_{\mathrm{adapted}}}{2}.$$

The iterative process converges to the predicted perceived image, while the color "fills-in" the stimulus.

# **RESULT**

The ability of the model to reduce the effect of LCA was tested on both the artificial and natural images. Retinal images were simulated by using the ISETBIO toolbox, which takes into account the properties of the human optical system (see Methods). The LCA effect is very prominent when zooming into areas of luminance or chromatic edges (**Figure 2**).

**Figure 2** demonstrates the model's performance on an artificial achromatic grid (**Figure 2A**) composed of equal energy squares. The image that is cast on the retina was calculated using ISETBIO (**Figure 2B**). It can be seen that this image (which simulates the eye's optics, including the LCA) has major chromatic distortions adjacent to the borders (**Figures 2B,D**). The distortion appears "yellowish" (lack of blue) on the bright side of the border and "bluish" on the darker side. **Figures 2C,E** present the effect of the model, which simulates the retinal response and its perceived image. **Figures 2B–E** show that the model succeeds in significantly reducing the chromatic-border distortion.

**Figure 3** plots the chromatic contrast, defined as the ratio between the value of the blue and yellow channels [B/(R + G)], across the *x*-axis of **Figures 2B,C**. This chromatic contrast represents the chromatic deviation from neutral hue (achromatic region). An achromatic region is characterized by a contrast value of 1, while the higher and lower values represent deviations toward bluish and yellowish chroma, respectively.

The blue curve plots the chromatic contrast across the cast image (**Figure 2**). The fringes of the plot are indicated by the large negative and positive spikes next to the borders (*x* = 90). The results given by our model (red line) show a significant reduction of the spike magnitude, indicating a significant reduction of the

<sup>2</sup> http://www.imageval.com/public/Products/ISET-SceneDatabase.html.

chromatic fringes. The deviation from white is also significantly diminished. It should be noted that there is some constant hue generated mainly on the "black" squares, which is a side effect of the ISETBIO simulation, rather than an ideal achromatic appearance (contrast value of 1).

We also tested the model's ability to compensate for LCA on real images (**Figure 4A**), taken from the ISETBIO HDRS library. The optics of the eye was simulated using the ISETBIO (**Figure 4B**; see Methods). The results show that the model succeeds in correcting the chromatic distortions around borders (**Figure 4C**). The correction is prominent in the distorted puppy dog's eye color and the distorted green–white pattern behind the dog (**Figure 4D-F**). Although the model significantly reduces the distortion caused by LCA, it can also cause some minor chromatic artifacts.

The neuronal mechanism that we propose as capable of correcting for chromatic aberration is bound by the limitations of the spatial frequency of the *S*/*LM* channel (Eq. 10; see Model). In other words, a crucial part of the model suggests that the *S*/*LM* channel is processed through a spatial low-pass filter. If such a mechanism actually exists, we would predict that it would lead to visual phenomena that are prominent at stimuli with high frequencies of blue/yellow chromaticity. We would expect to see these phenomena as a blue–yellow assimilation effect, at highspatial frequencies or among adjacent chromatic regions with sharp edges. These characteristics correspond closely to with a recent outstanding chromatic illusion, which is termed as "Chromatic induction from *S*-cone patterns" and described by Monnier and Shevell (2004) (**Figure 5**).

This illusion describes the perception of a chromatic specific narrow ring with color that differs completely, depending on the specific chromaticity of an adjacent ring (**Figure 5**). Psychophysical methods of analysis indicate that the chromatic shift is not directly dependent on the absolute blue channel intensity (*S*) of the blue component of the adjacent rings but rather on the relative amount of "blue" and "yellow" intensities (*S*/*LM*) in the adjacent rings (Shevell and Monnier, 2006).

We also tested our model on *S*-cone pattern stimuli, which have been reported by Monnier and Shevell (2004) to demonstrate prominent chromatic induction. The results (**Figure 5**) show that our model succeeds in predicting the trend of the perceived chromaticity shift toward the chromaticity of the adjacent ring (**Figure 5D**). The predicted chromatic shifts, between the two test chromaticities (the orange and pink rings) in terms of chromatic contrast [*S*/(*L* + *M*)], are about 0.31. This shift agrees with the perceived colors as measured psychophysically by Shevell and Monnier.

#### **DISCUSSION**

This manuscript describes a neuronal mechanism and a computational model, based on retinal chromatic RFs and visual pathways, that compensate for LCA. The model can significantly reduce the chromatic distortion at both the artificial and natural images (**Figures 2** and **3**). The proposal is supported by the observation that an artifact of chromatic assimilation, which is a predicted consequence of the model, corresponds to a wellknown chromatic assimilation phenomenon described previously (Shevell and Monnier, 2005).

The model is based on the specific spatial and chromatic structure of the blue–yellow channel (*S*/*L* + *M*) RFs, which are spatially coextensive "type-II" small bistratified cell (SBC) (see Model; Hubel and Wiesel, 1968; de Monasterio, 1978; Derrington et al., 1984; Tailby et al., 2008; Crook et al., 2009; Martin and Lee, 2014) and correspond to the activities of the SBCs. These type-II RFs are incorporated into a retinal adaptation model (Spitzer and Barkan, 2005), and then the RF responses are subjected to an inverse function that mediates a transformation to perceived values. This transformation enables an evaluation of the model by

consideration of an image domain, rather than merely on the basis of the RF responses.

left-hand side ring appears pinkish.

There has been some dispute in the literature regarding the spatial coextensive nature of the SBC. The coextensive nature of the SBC has been described by many electrophysiological researchers (Hubel and Wiesel, 1968; de Monasterio, 1978; Derrington et al., 1984). A recent experiment reported that the SBC RF may not be spatially coextensive (Field et al., 2007). However, these results have been criticized first because the data in Field et al. (2007) were collected in the far retinal periphery (30–75° eccentricity), where more recent and broad reports of the RF were recorded within the central 20°(Hubel and Wiesel, 1968; de Monasterio, 1978; Derrington et al., 1984). Crook et al. (2009) found that the *S*-ON and *LM*-OFF responses were spatially coextensive, or nearly so. Furthermore, this trend of results was supported by large previous papers including recent reports and a review (Tailby et al., 2008; Crook et al., 2009; Martin and Lee, 2014).

A logical conclusion may be that the development of visual system has been strongly influenced by the natural visual scenery. Most of the sun's spectral energy on earth is yellowish (550 nm) (Figure 1.2.1 in Wyszecki and Stiles, 1982), giving fewer chromatic edges in natural scenes than achromatic edges, and with a predominance of red–green chromatic edges over blue–yellow (Hansen and Gegenfurtner, 2009). The peak of the spectral luminance efficiency of the visual system (Wyszecki and Stiles, 1982) is similar to the peak of the sun's spectral energy with the ocular lens tuned for optimal focus at the same wavelength. The chromatic aberration occurs in the short wavelengths, where there is both less solar irradiance and fewer chromatic edges in natural images. It therefore appears that the ocular lens is designed to provide the optimal performance at the prominent natural wavelength (~550 nm) while allowing the aberration at shorter wavelengths, which are less significant both for spatial and luminance information.

Although the ocular lens is tuned to the most "important wavelengths," it still suffers from the consequences of the chromatic aberration. It is plausible that the neural system compensates for some of these optical imperfections (Wandell, 1995). We propose that the visual mechanism utilizes the absence of sharp blue–yellow edges to diminish the effect of chromatic distortions. In the model, this is replicated by the following mechanisms, whose existence is supported by psychophysics and neurophysiologic findings.

Luminance and high-spatial resolution chromatic information, under photopic light conditions, is obtained mainly from the *L* and *M* channels—which suffer less from LCA. This idea is supported by psychophysical evidence showing that the contribution of the *S* cone to luminance perception is negligible or null (Eisner and MacLeod, 1980; Wyszecki and Stiles, 1982). This knowledge has been also applied in the definition of the classical CIE color space where, for example, the *V*(λ)*s* describing the spectral luminance efficiency (i.e., perceived brightness vs. wavelength) come mainly from greenish and red light (Wyszecki and Stiles, 1982). As a result, brightness is calculated by perceived *L* and *M* values with almost no input from the *S* channel (Eq. 14), while the calculation of the chromaticity takes the contribution of the *S* value into account as well as the contribution of the other chromatic channels (Eq. 15).

The opponent RF structure of the *S* channels (SBCs) is both spatially coextensive and chromatically complementary (Dacey, 1996; Rodieck, 1998; Eq. 10). Such an RF blurs the blue–yellow information, so that their chromatic mixture yields an achromatic color. In addition, the spatio-chromatic structure [of *S*/(*L* + *M*) RF] yields a null response to achromatic edges, also in the presence of LCA affecting the *S* channel. In this way, the unique spatio-chromatic property minimizes the chromatic distortion (see Results; **Figure 2**).

In order to maintain the compensatory advantage at the retinal stage, which separates high-spatial frequency information from low-spatial frequency chromatic information, the system has to further process these two channels separately. There are

physiological findings, which show that the SBC RF (with B/Y chromatic structure) indeed feeds a distinct chromatic pathway, i.e., the koniocellular pathway (Hendry and Reid, 2000). The origin of the koniocellular pathway lies in the SBC in the retina, and the pathway is then relayed by the koniocellular layer in the LGN to the cytochrome-oxidase blobs in V1. Several studies have reported that information on color *per se* and information on form are separated (Livingstone and Hubel, 1988; Van Essen and Gallant, 1994; Sincich and Horton, 2005). The information on form is derived solely from the parvocellular pathway [which lacks the *S*/(*L* + *M*) information]. The information on color, however, comes from both the koniocellular and parvocellular pathways. The parvocellular pathway sends inputs from layer 4*c*β to the blobs in layer 2/3, area V1. The two separate pathways (color and form) do have different anatomical inputs in the V2 area. Here, the thin stripes that code the color information are fed both from the konio and parvo pathways, whereas the pale strips, which code the form information, are fed only by the parvo pathway. The "form" pathway is therefore not affected by the deficiencies of the *S*/(*L* + *M*) pathway. Both pathways project to area V4 and additional higher visual areas.

Previous studies that proposed neuronal mechanisms to compensate for chromatic aberration (Hay et al., 1963; Broerse et al., 1999; Grossberg et al., 2002; Vladusich and Broerse, 2002) related these mechanisms to long-term after-effects, such as the ME—a long-term orientation-contingent color after-effect (McCollough, 1965). Vladusich and Broerse (2002) proposed a learning neuronal model that inhibits the fringes at luminance boundaries (caused by chromatic aberrations). Grossberg et al. (2002) proposed a learning mechanism whose primary function is to adaptively align the representations of the boundaries and surfaces, which are shifted due to the process of binocular fusion. Their mechanism was able to predict the ME. Since the ME has been previously suggested as the compensation mechanism for chromatic aberration, the model presented by Grossberg et al. (2002) was also regarded as a compensation model for LCA.

In our opinion, there are two main arguments against the idea that ME models can completely explain neuronal compensation to LCA. The first limitation of the above models (Broerse et al., 1999; Grossberg et al., 2002; Vladusich and Broerse, 2002) is that they assume that the magnitude of LCA effect depends solely on the magnitude of the luminance edge. However, the LCA effect also depends on additional optical factors, such as the pupil aperture (DeValois and DeValois, 1991), whose size changes dynamically in response to the level of ambient illumination and accommodation. Such learning mechanisms, therefore, would be expected to yield chromatic artifacts when the pupil aperture size changes and would therefore require continuous adaptation of the learning mechanism. The learning models described above may therefore be more applicable to transverse chromatic aberration (TCA), which does not depend on the pupil size. Thus, there could be two different and complementary mechanisms for the two types of aberrations, i.e., TCA and LCA.

An additional limitation of previous models (Broerse et al., 1999; Grossberg et al., 2002; Vladusich and Broerse, 2002) is their assumption that the LCA is triggered only by achromatic boundaries. In fact, chromatic aberration (and specifically the LCA) also occurs at iso-luminance chromatic boundaries, where there are no achromatic boundaries (**Figure 1**). Consequently, the above models fail to explain how the visual system processes chromatic fringes at non-achromatic borders.

The two types or mechanisms, the current proposed retinal model, and the above learning mechanisms can be synergetic in the visual system. The retinal mechanism performs an early-stage correction that eliminates most of the LCA effects, regardless of the degree of illumination and eye accommodation. The cortical learning mechanism (Watanabe et al., 1992; Broerse et al., 1999; Grossberg et al., 2002; Vladusich and Broerse, 2002; Grossberg, 2003) performs long-term adaptation that can adapt to specific ocular changes (such as lens defects that can be caused by aging or physical damage, etc.).

Although several studies have examined the improvement of visual acuity through optical correction of LCA (Campbell and Gubisch, 1967; Yoon and Williams, 2002; Artal et al., 2010), none found better than minor improvement (or none) of the contrast sensitivity. One may argue that these results suggest that LCA is not a real problem of the optical system, since correcting it does not create any significant improvement. However, in our opinion this would be an erroneous conclusion, since the whole visual pathway is already optimized to contend with the optical limitations. Therefore, correction of the optical limitations is not able to improve the situation further and it is necessary to invoke neuronal processing (including photoreceptor accommodation, RF structure and size, the different neuronal processing pathways, etc.).

Furthermore, LCA is expected to be manifested not only adjacently to achromatic edges but also in many other spatial and chromatic configurations. For example, one would also expect LCA at iso-luminance chromatic edges and non-oriented edges (such as textures or dots on a uniform background). In such configurations, the visual image is clear, despite the fact that the "leakage" of short-wavelength colors is still expected to influence the chromatic appearance, and the postulated models are unable to provide compensation.

The strength of a computational model can be enhanced by showing its ability to predict additional phenomena. Evidence for the competence of our model comes from its ability to predict the enigmatic visual phenomenon of the large chromatic shifts by *S*-cone pattern (Shevell and Monnier, 2005; **Figure 5**).

Shevell and Monnier (2006) and Cao and Shevell (2005) suggested that the large color shifts are mediated by a spatially antagonist *S* + /*S−* cortical RF. The "*S*" term referred to the *S*-cone response normalized by the luminance. Cells with this type of response while not found in the retina have been identified in some neurons in V1 and V2 visual areas (Conway, 2001). Significantly, our model is based on retinal RFs (rather than cortical) (Hubel and Wiesel, 1968; de Monasterio, 1978; Derrington et al., 1984).

In addition, Shevell et al. also showed that the effect is more prominent with high-spatial frequency of the rings. We assume that this was the incentive to include spatially antagonist RFs in their qualitative model. We suggest, however, that an additional mechanism is recruited for low-frequency stimuli, i.e., simultaneous contrast mechanism (see Model, adaptation of the first order). Such a mechanism could originate from a retinal source (Spitzer and Barkan, 2005). This suggestion should be supported by additional experimental data, which should determine whether the effect originates from retinal vs. cortical mechanisms, as suggested previously (Cao and Shevell, 2005; Shevell and Monnier, 2006).

In summary, in this manuscript, we propose a model which explains how the visual system compensates for LCA. This compensatory mechanism can also explain additional visual

#### **REFERENCES**


phenomena, such as the large chromatic shifts by *S*-cone pattern, for which the underlying mechanism is still unknown. In addition, this mechanism can explain the necessity for two separate chromatic visual pathways, i.e., koniocellular and parvocellular pathways.

#### **AUTHOR CONTRIBUTIONS**

This is an original research done by YB under the supervision and parternship with HS.


Walls, G. L. (1963). *The Vertebrate Eye and Its Adaptive Radiation*. New York: Hafner Pub. Co.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2018 Barkan and Spitzer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

Wandell, B. A. (1995). *Foundations of Vision*. Sunderland, MA: Sinauer Associates.

# Short and Long-Term Attentional Firing Rates Can Be Explained by ST-Neuron Dynamics

Oscar J. Avella Gonzalez 1,2 \* and John K. Tsotsos 1,2

<sup>1</sup> Department of Electrical Engineering and Computer Science, York University, Toronto, ON, Canada, <sup>2</sup> Laboratory for Active and Attentive Vision, Centre for Vision Research, York University, Toronto, ON, Canada

Attention modulates neural selectivity and optimizes the allocation of cortical resources during visual tasks. A large number of experimental studies in primates and humans provide ample evidence. As an underlying principle of visual attention, some theoretical models suggested the existence of a gain element that enhances contrast of the attended stimuli. In contrast, the Selective Tuning model of attention (ST) proposes an attentional mechanism based on suppression of irrelevant signals. In this paper, we present an updated characterization of the ST-neuron proposed by the Selective Tuning model, and suggest that the inclusion of adaptation currents (Ih) to ST-neurons may explain the temporal profiles of the firing rates recorded in single V4 cells during attentional tasks. Furthermore, using the model we show that the interaction between stimulus-selectivity of a neuron and attention shapes the profile of the firing rate, and is enough to explain its fast modulation and other discontinuities observed, when the neuron responds to a sudden switch of stimulus, or when one stimulus is added to another during a visual task.

#### Edited by:

Xavier Otazu, Universitat Autònoma de Barcelona, Spain

#### Reviewed by:

Keith Schneider, University of Delaware, United States Jihyun Yeonan-Kim, San Jose State University, United States

> \*Correspondence: Oscar J. Avella Gonzalez oscarjavella@gmail.com

#### Specialty section:

This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience

Received: 11 August 2017 Accepted: 15 February 2018 Published: 02 March 2018

#### Citation:

Avella Gonzalez OJ and Tsotsos JK (2018) Short and Long-Term Attentional Firing Rates Can Be Explained by ST-Neuron Dynamics. Front. Neurosci. 12:123. doi: 10.3389/fnins.2018.00123 Keywords: visual attention, single cell, ST-neuron, firing rate, neural selectivity

# INTRODUCTION

Attention can be widely defined as "the selective prioritization of the neural representations that are most relevant to one's current behavioral goal" (Buschman and Kastner, 2015). Since James' pioneering work (James, 1891), research on attention has aimed to discover a precise and systematic description of how the brain is able to manage its limited resources for performing complex cognitive and behavioral tasks. Visual attention, as one component of attention, has received significant interest (Itti et al., 2005; Carrasco, 2011; Posner, 2011), leading to the proposal of detailed descriptions of aspects like bottom-up attention (Itti and Koch, 2001; Rutishauser et al., 2004; Itti, 2005) and top-down control (Corbetta and Shulman, 2002; Oliva et al., 2003; Buschman and Miller, 2007; Bressler et al., 2008), signal integration (Corbetta et al., 1991; Rao et al., 1997; Eagleman and Sejnowski, 2000), or focus of attention (Koch and Ullman, 1987; Desimone and Duncan, 1995; Tsotsos et al., 1995).

Mathematical models as a wide-spread strategy are used to make insightful predictions about neural communication, and brain dynamics in general (Hodgkin and Huxley, 1952; Destexhe et al., 1998; Kandel et al., 2000; Dayan and Abbott, 2001; Shriki et al., 2003; Izhikevich, 2004). Concerning visual attention, a number of relevant models have been proposed to study particular aspects concerned with the way single neurons and circuits process incoming information during visual tasks (Tsotsos, 1990; Niebur and Koch, 1994; Reynolds et al., 1999; Deco and Lee, 2002; Reynolds and Heeger, 2009); One of these aspects, treated by different studies and that currently draws special interest, is the mechanism neurons use during attentional tasks to accurately encode, classify and prioritize dissimilar information using only their firing rates. For instance, in the biased competition model by Reynolds et al. (1999), stimuli compete for a cortical representation, and the average firing rate (response) of a neural population depends on the interaction between the selectivity of the cells for one particular type of stimulus or feature, and the modulation induced by attention. The feature similarity model of Martinez-Trujillo and Treue proposes that attention enhances neural selectivity (Martinez-Trujillo and Treue, 2004), thus causing neurons to increase their firing rate. The idea aligns well with the normalization model (Lee and Maunsell, 2009; Reynolds and Heeger, 2009) in which such enhancement relates to the contrast between the attended stimulus and the surrounding background perceived by a neural population. Other models also explore the relation between the detailed anatomy of the neurons and the response to the attentive signal. The Feedback model for example, acknowledges attention as a top-down process that operates via cortical feedback, and represents it using a gain factor that modulates the activity of impinging connections to a given neuron (Spratling and Johnson, 2004). It also takes into consideration physiological properties such as the roles of the basal (feedforward) and apical (feedback) connections, and how by adding those elements it is possible to resemble the response of pyramidal cells during attentional tasks (Spratling and Johnson, 2004). In the Selective Tuning model (ST) (Tsotsos, 1990, 2011; Rothenstein and Tsotsos, 2014), attention is also embodied as a top-down signal; but in contrast to other models, its selection mechanism fully relies on suppression of the irrelevant inputs to each neuron instead of the enhancement of their activity (Tsotsos, 1990, 2011), as supported by strong experimental evidence (Cutzu and Tsotsos, 2003; Loach et al., 2005; Hopf et al., 2006; Bartsch et al., 2017).

Adaptation mechanisms are well known for their facilitating role in detecting weak signals by means of stochastic resonance (Wiesenfeld and Moss, 1995) or through sub-threshold oscillations enhancement (Dorval and White, 2005). In a previous modeling study Rothenstein and Tsotsos (2014) found that by incorporating adaptation mechanisms, the overall performance of the ST neuron was improved during a simple attentional task. Thus, counterbalancing the rapid saturation of the firing rate due to the presentation of a highly affine stimulus, while resembling the shape of the firing profiles recorded in V4 visual cells (Kosai et al., 2014) (Figures 2, 3 therein). As a follow up of that study, in the present paper we perform a detailed characterization of the ST-neuron firing pattern with and without adaptation currents (Ih) (Pape, 1996). Next, and following the design by Reynolds et al., (Reynolds et al., 1999) we implement a simple circuit to explore various scenarios in which adaptation currents play a role in reshaping the firing profile of the neuron, either by fine-tuning it, or by increasing the sensitivity of the cell to the attentional signal.

The contribution of adaptation currents to the cell's dynamics is further highlighted, by simulating a set of experiments that strikingly uncovers the interplay between neural selectivity and attention as a twofold effect. It first creates a transitory and a stationary scenario in the firing response of the recorded cell; and second, induces the transition between the firing patterns evoked by two competitive stimuli in a task-dependent fashion. We also compare the results of our simulations against experimental findings, and show how the incorporation of Ih on the STmodel leads the response to closely resemble the transient and long-lasting effects observed in experimental data.

#### METHODS

Our model consists of four essential elements: the ST-neuron model, the circuit's design and connectivity, the neural selectivity, and the selection mechanism of attention.

#### The ST-Neuron

The Selective Tuning model of attention (ST) relies on the ST-neuron as its building block (Tsotsos, 1990, 2011; Tsotsos et al., 1995). The ST-neuron is responsible for the integration and propagation of signals across the visual hierarchy, and both implements attentional selection as well as displays modulations resulting from top-down attentional signals. As a rate-based model, the response is quantified by the temporal evolution of the firing rate (FR) according to Equation (1):

$$\frac{dFR}{dt} = \frac{1}{\pi} \cdot (-FR + S\,(P))\tag{1}$$

In this expression, P is the synaptic input, S (P) = MP<sup>ξ</sup> σ <sup>ξ</sup>+P ξ is the Naka-Rushton sigmoid function, whose value depends on the maximum firing rate M, the semi-saturation constant σ, i.e., the particular value of the input for which S(σ) = 1 <sup>2</sup>M, and the constant factor ξ that determines the slope of S(P), i.e., how quickly it saturates. Aiming to resemble the time evolution of the firing rate FR, the response of the cells was restricted to the interval [0,1] by setting M = 1, and the semi-saturation constant σ = σ0, with σ<sup>0</sup> = 0.25·M. The latter was chosen in order to prevent P from growing too fast and to avoid stepwise behavior of the activation function. The factor ξ = 3, is a heuristic parameter whose value for neurons in the visual cortex was previously reported by Wilson (1999). With this choice of values for all parameters we ensure that for P = 1, S (P) = M 0.25·M<sup>ξ</sup> +1 ∼= 0.98; i.e. the reachable ceiling of the rate is not significantly attenuated irrespective of M (see **Figure 2A**). This represents a normalized and ideal scenario in which all impinging connections to a neuron are excitatory. Finally, τ represents the time constant of the activation and was set to τ = 10 ms, thus satisfying the kinetics of gabaergic receptors such as GABAA, and matching the average duration of the post–inhibition refractory period (Whittington et al., 2000; van Aerde et al., 2009).

Similar to Rothenstein and Tsotsos (2014), we considered the effect of adaptation currents Ih on the ST-neuron, and incorporated them in the dynamic equation as additive factors that modulate the magnitude of the semi-saturation constant σ. The new σ(t) is then re-computed at every time-step using Equation (2) as follows:

$$
\sigma\_{\;}(t) = \sigma\_0 + f\_{\text{slow}} \cdot H\_{\text{slow}}\;(t) + f\_{\text{fast}} \cdot H\_{\text{fast}}\;(t) \tag{2}
$$

where σ<sup>0</sup> is the original parameter. Adaptation currents consist of two different components Hslow and Hfast, each evolving within a particular time-scale, coupled to the value of the firing rate FR, and whose time course is scaled by the characteristic time constant τ<sup>x</sup> with x being either fast or slow. In turn, fslow and ffast are the values of the amplitude for each contribution. The temporal evolution of the two components is given by the Equation (3):

$$\frac{dH(t)\_{fast}}{dt} = \frac{1}{\tau\_{fast}} \cdot \left(-H\left(t\right)\_{fast} + FR\left(t\right)\right)$$
 
$$\text{and} \tag{3}$$

$$\frac{dH(t)\_{slow}}{dt} = \frac{1}{\tau\_{slow}} \cdot \left(-H\left(t\right)\_{slow} + FR\left(t\right)\right)$$

Equations (1–3) are independently updated for each neuron at every time step (1t = 2 ms) using a customized Runge-Kutta 4 algorithm implemented in MATLAB 2016a, (The MathWorks, Inc.). The original details of the implementation can be found in Wilson (1999).

#### Circuit Design and Connectivity

Following the original design by Reynolds et al. (1999), our circuit aims to represent a three tier structure, in which the response of the top-most unit quantifies the model's performance. The time course of this response was computed when the representations of two stimuli, each of which could be located either within or outside the cell's receptive field (RF), competed for representation (see **Figure 1C**). The bottom layer represented by two colored upwards arrows, contains the input representation. The Intermediate layer consists of two units, each accounting for the average response of individual populations (black ellipses) of ST-neurons, and are tuned to the stimulus directly below them. This level represents the activation of the populations at V1-V2 cortices. In turn, the neuron located at the top was defined as the main neuron (top circle). This unit represents a V4 cell, whose complex receptive field is able to process whole object representations.

Inputs at the bottom are represented by particular combinations of excitatory and inhibitory connection weights projected to the intermediate layer. Each intermediate population receives excitatory (red continuous arrows) and inhibitory connections (green dotted arrows) from the input, and project them to the top. The top unit receives both types of feed-forward inputs from the intermediate layer. **Figure 1B** shows a simplified version of the circuit in which a single stimulus is presented and processed. Connection weights were defined in the interval [−1, 1], with the convention that w is inhibitory if −1 ≤ w < 0, and excitatory if 0 ≤ w ≤ 1. In consequence, any potential changes to the stimulus properties should be reflected as changes in the combination of connection weights representing it. During the time course of each simulation the set of excitatory and inhibitory connection weights from the intermediate layer onto the target (top) neuron remained fixed. Consistent with our assumptions, the representation of a given stimulus consisted of setting only the excitatory and inhibitory connection weights from the bottom to the intermediate layer. All other parameters were fixed within and across simulations, unless otherwise stated.

#### Neural Selectivity

Neural selectivity is the mechanism by which a neuron raises its firing rate when a stimulus has a certain feature matching its tuning curve. Thus, a preferred stimulus is one for which the neural selectivity is high. In order to incorporate selectivity into the circuit, and provided that neurons were connected through inhibitory and excitatory inputs with particular connection weights, we assumed for a preferred stimulus an excitatory (E) connection weight w<sup>E</sup> belonging to the interval 0.75 < w<sup>E</sup> ≤ 1, and consequently an inhibitory (I) weight w<sup>I</sup> = 1 − wE, belonging to 0 ≤ w<sup>I</sup> ≤ 0.25. In the case of a stimulus with low selectivity i.e., one for which the cell selectivity is low, the inhibitory weight approached w<sup>I</sup> = 1 and the excitatory w<sup>E</sup> = 0. For the sake of convenience, and bearing in mind that for the current normalized case the sum of weighted E and I inputs satisfies P |w<sup>I</sup> | · I + |w<sup>E</sup> | · E = 1, any stimuli with 0.7 ≤ w<sup>E</sup> ≤ 0.75 were considered as of neutral selectivity. Stimuli with 0.75 ≤ w<sup>E</sup> ≤ 1 were defined as preferred (or having high selectivity), and stimuli with 0.5 < w<sup>E</sup> < 0.7 were defined as non-preferred (or having low selectivity).

### ST's Top-Down Attentional Signal

The attentional signal was implemented in consonance with the ST model, by creating a top-down branch-and-bound selection mechanism that picked the targets and suppressed the neural representation of the distractors, as described in Tsotsos (2011). The amplitude of the signal between belonged to the range [0, 1], and was computed like the absolute difference between the magnitude of the activation of the intermediate units, and the resulting factor was used to multiply the weights of the unit, associated to irrelevant input. This process has been fully described several times previously, most recently in Tsotsos (2011) and thus will not be repeated here.

### RESULTS

## Characterizing the ST Neuron Dynamics

In order to extend previous findings, we first characterized the time course of the neuron in relation to basic parameters, and then by modeling the response of the neuron after incorporating adaptation mechanisms, we evaluated their effect on the cell's firing dynamics during a set of simulated visual tasks.

In absence of adaptation mechanisms the activation of the ST-model neuron is determined by the two parameters σ and τ of the Naka-Rushton function (see equation 1. in section Methods). Although this function was first introduced in order to account for the adaptive saturation of photoreceptors to particular illumination conditions, its role in shaping the response of the ST-neuron was not previously addressed.

Frontiers in Neuroscience | www.frontiersin.org

**44**

4 March 2018 | Volume 12 | Article 123

As a two-step exercise we first fixed the value of τ and varied σ and then we flipped this, fixing σ while varying τ. In the first case, we assumed M = 1.0, and σ = k·M, for k = 0, 0.25, 0.5, 0.75, and 1.0, obtaining the response curve shown in **Figure 2A**. Its shape followed a sigmoid pattern with amplitude of saturation (maxFR) proportional to the choice for σ, counterbalanced by P, and scaled by M (red curve in **Figure 1A**). Our simulations show that for every σ, the FR-profile saturated within the initial 50 ms. In the case of larger σ, any variation in k led to monotonic decrements of the saturation rate's magnitude (maxFR) (**Figures 2A,B**). The analytical relation was well described by the expression maxFR = −0.54 · σ <sup>2</sup> + 0.0076 · σ, with a resulting norm of residuals nr = 0.024696. This result suggests that in the limiting condition σ→ 0, the smaller the value of σ the closer maxFR is to M.

By fixing σ and varying τ within a biologically plausible range with τ = 0.0, 5.0, 10.0, 15.0, and 20.0 ms rather than variations on maxFR, we observed significant effects on the timing required by the sub-saturation period (rising phase) to reach maxFR (see **Figures 2C,D**). In spite of the reasonable behavior of the model's output for τ ∼= 10–20 ms, we embraced experimental observations from previous studies (Jensen et al., 2005) choosing τ = 10 ms, which on one hand accounts for an acceptable durations of the sub-saturation period of around 20 ms, and on the other coincides with the reported time constant of GABAergic synapses such as GABAA, aligning also with the idea that "..tonic inhibition in single neurons increases the firing threshold and reduces the membrane time constant . . . " (Hutt, 2012). In the case of τ shorter than 10 ms unrealistically fast saturation of the rate occurred, while for τ much larger than 20 ms, sub-saturation intervals were also extremely long. In general, the response of the model shows consistency with experimental findings (Kandel et al., 2000) deploying a relation between the duration of the time required for the firing rate to saturate, i.e., the sub-saturation period sSP and τ, given by the analytical expression sSP = 130 · τ <sup>2</sup> + 6.6 · τ + 0.022, with a norm of residuals n = 0.00775. Although the results for smaller

τ's might reflect the action of other mechanisms, those do not necessarily represent the dynamics in the visual cortex (Cavelier et al., 2005).

A general result extracted from this simple analysis shows that far from interfering with one another, σ and τ control and modulate different parameters of the cell's activation, and their joint action reliably accounts for the efficacy of individual neurons to tune their firing to particular feature(s) of the synaptic representation of a certain stimulus.

## Effects of the Adaptation Currents (Ih) on the Firing Rate of a Single Cell

An overall comparison between the FR-profile of the neuron without Ih and with Ih is depicted in **Figure 1A**. The stimulus onset occurred at t = 0 and the removal at t = 250 ms. Note the unaffected FR-profile's rising phase of the with-Ih scenario (blue trace) and the appreciable changes occurring during the post-saturation of the with-Ih case compared to the non-Ih case (red trace). As in Rothenstein and Tsotsos (2014) Ih currents are represented by the linear combination of a slow (Hs) and a fast (H<sup>f</sup> ) component, whose time courses are depicted in **Figure 2E** by the blue and purple traces respectively. The modulation imposed on the constant σ (yellow trace on top) shows a periodic signal that slowly raises from σ<sup>0</sup> to its maximum within ∼130 ms, and exponentially decays within a comparable interval (∼120 ms). As previously mentioned, the FR's rising phase remains unaffected and the overall effect is constrained to its post-saturation phase in a two step process (see **Figures 2E,F**): In the first, during a transitory interval (∼50 ms), the firing rate is driven by the activation of the Ih's fast component Hfast, leading the FR-profile to rapidly decay to ∼70–80% of its maximum (maxFR). In the second, and due to Hfast having reached its maximum, the slow activation of Hslow takes over the control and reduces the speed of the FR decay, leading to a pseudo-plateau in the FR-profile, in which, in absence of any further changes in the stimulus, the FR remains constant.

FIGURE 2 | Temporal evolution of the ST-neuron's firing rate. (A) For a constant input, the amplitude of the firing rate has a transitory pre-saturation period which is independent of the half saturation constant σ. However, after this point and depending on its magnitude, increasing σ led in a minor or major proportion the saturation rate of the cell to fall and reach smaller maxFRs. (B) Analytical expression of the relation between variables depicted in (A) firing rate and σ are related through a quadratic function for which small values of σ near 0 rapidly makes maxFR ∼= M. (C) A similar relation rules the effect of τ on the time required by the firing rate to saturate when σ was kept fixed. The simulation shows strong modulation before the 100 ms point of each simulation. In spite of maxFR remaining unchanged, the duration of the sub-saturation period increased proportionally to τ following the trend plotted in (D). A representation of the temporal pattern for the fast (Hf ) and slow (Hs) components of the Ih-current is shown in (E). The combined effect of the two components modulates the firing rate by adding temporal dependence to σ (see Equation 2 section Methods), whose dynamics is represented by the top trace in (E). The response of the top cell in (F) shows the effect on the FR-profile when submitted to the action of the synaptic inputs and the activation of Ih. Here the values of σ are identical to (A). Note in the latter the decaying post-saturation profile and the generation of bumps before reaching the stationary firing regime.

# Response of ST-Neuron (With Ih) to Stimuli With Different Selectivity

To run this set of experiments we initially assumed attention not to be directed to the stimuli; thus the time course of the FR-profile only depended on the neuron's selectivity to a given stimulus. We simulated various (uniquely defined) types of inputs with selectivity being accounted for by the relative contribution of the inhibitory and excitatory connections.

In each experiment a given pair of stimuli was shown as input to the circuit of **Figure 1C** (for details see section Methods). To maintain consistency with psychophysical studies, we refer to the first stimulus as the reference, whose onset time occurred at t = 0 ms and its removal at t = t ′ with t′ > 0, coinciding with the onset of the second stimulus that remained active until the end of the simulation and was denoted by the probe. The time t = t ′ was designated as the switching time. In addition, the processing of each stimulus activated only one of the intermediate populations, and the probe stayed active until the end of the simulation, whose total duration of 300 ms was considered to be long enough to allow input-related information to propagate from the bottom to the top neuron (target).

**Figure 3** shows the FR-profile's time course of the top neuron being initially driven by the reference, whose rising phase remained unaltered irrespective of how early t′ occurred, while being significantly affected on its post-saturation period in two ways. First, a latency appeared, caused by the decay of the initial FR and second, a sudden rebound appeared with maxFR depending on the probe alone. During the latency, and as an effect of switching inputs, the FR-profile became unstable leading to a transient drop and catch phase characterized by a discontinuous change of concavity and followed by a fast regain of firing. Once the FR surpassed maxFR due to the cell being engaged to the probe, the profile decays following the dynamics described in the previous section, with a pseudostationary state being ruled by the slow Ih's component. In every experiment a neutral reference i.e. excitatory synaptic weight WE−ref = 0.7 (blue continuous trace) systematically preceded the probe, each of which had identical (WE−<sup>p</sup> = 0.7), larger (WE−<sup>p</sup> = 0.75, 0.80) or smaller selectivity (WE−<sup>p</sup> = 0.65, 0.60, 0.55) than the reference. While the larger probes led to steeper jumps in the firing rate and bumps characterized by large maxFRs, stimuli with lower selectivity led to an even faster decay of the FR. The stationary response always equated the stationary response evoked by the probe in the absence of other inputs. Note that probes with identical selectivity to the reference did not align with the expected smooth profile evoked by the reference. An explanation to this is that the original tuning (i.e., the combination of weights) of the intermediate unit processing the probe was different from that of the reference and in consequence led to small bumps in the model (see purple traces in **Figure 3**) Measuring the plausibility of this effect needs further study and is left as an interesting open research point.

In general, the distortion in the reference's FR-profile was easier to recognize for probes presented briefly after the reference's onset i.e., t′ less than 200 ms. This result was consistent irrespective of the probe's selectivity (compare the shapes of the profiles in 3.A-3.C against those in 3.D-3.F). Note that in the case of a late t′ , the transitory state did not interfere with the original time course of the FR-profile, but took place once the cell was close to the FR-profile's plateau, which could be interpreted as the replication of the original activity, but now due to the probe and with a different base rate.

Concerning the latency, our results show that for probes with less selectivity than the reference, the firing dropped and slowly recovered producing a smooth trough in the FR-profile, whose depth and width specifically depended on the relative difference of selectivity between both stimuli, being wider for less preferred probes, while in the case of probes with larger selectivity than the reference the width of the trough was negligible, and the FRprofile discontinuously lost and regained firing after switching stimuli. In general the particular shape and steepness of the bumps depended on the relative selectivity of the reference and probe, and once the transition occurred the rate slowly tended to stabilize around the stationary state evoked by the probe.

## Adding the Probe to the Reference Modulates FR-Profile but Induces No Latencies

As a second scenario, instead of switching stimuli at t′ , we modeled a condition in which the probe was added to the reference, while computing the time course of the top neuron's FR-profile (**Figure 4**). We ran the experiment for different probe selectivity and onset times t′ as follows: WE−<sup>p</sup> = 0.55, 0. 60, 0.65, 0.70, 0.75, 0.80 (recalling that WI−<sup>x</sup> = 1-WE−<sup>x</sup> with x = ref or p), using a neutral reference (i.e., WE−ref = 0.7) presented at t = 0. The FR-profiles in **Figure 4** show that in contrast to the previous case (see **Figure 3**), and in the absence of attention, adding the probe at t = t ′ produced no decaying latencies. Furthermore, probes with larger selectivity than the reference induced almost instantaneous rebounding bumps but in this case the amplitude of maxFR for the two stimuli never reached that of the reference alone, while less preferred probes led to a sudden drop followed by a less frequent but sustained and regular firing of the cell. Without exception for all probes, the value of maxFR was fixed across each of the diagram showing t′ = 50, 100, 150 ms. In contrast, for t′ > 150 ms, i.e., t′ = 200, 250, and 300 the amplitude of the maxFR for more preferred probes equated that of the reference alone, while for the less preferred it got closer to zero for late t′ followed by a smooth recovery with low but sustained firing.

In all cases the transient phases were followed by a recovery leading to a stationary rate. Since the sharp rebounding/dropping effect was a direct result of the presence of Ih and of the cell modulating its selectivity due to the probe being added, we hypothesize that as a result of trial and error such a change of concavity (inflection point in the first time derivative) may be utilized as a suitable selection cue to predict the stimulus' category. In particular, the computation of the instantaneous (not the average) derivative satisfies that requirement, and only demands local adaptation of the cell's firing.

FIGURE 3 | Stimulus exchange leads to strong discontinuities and transients in the FR-profile. Experiments were run simulating a fixed interval of 600 ms, and exchanging the stimulus at (A) t ′ = 50 ms, (B) t ′ = 100 ms, (C) t ′ = 150 ms, (D) t ′ =200 ms, (E) t ′ = 250 ms, and (F) t ′ = 300 ms. Colored traces indicate the probe's selectivity characterized by the excitatory weight WE-p (refer to labels in Methods for details). Switching from a neutral reference (WE−ref = 0.7) to a probe with larger or smaller selectivity created unstable surges of firing, followed by a stationary state. Note that in the case of a late t′ , the transitory state did not interfere with the original time course of the FR-profile, but took place after the cell's recovery and near to the FR-profile's plateau.

FIGURE 4 | Adding a probe to the reference destabilized and induced transients on the firing rate. The reference stimulus was presented at t = 0 ms (WE−ref = 0.7), and different probes were added at (A) t ′ = 50 ms, (B) t ′ = 100 ms, (C) t ′ = 150 ms, (D) t ′ = 200 ms, (E) t ′ = 250 ms, and (F) t ′ = 300 ms. Similar to the exchange experiment, transient bumps/troughs indicated sharp variations in the FR-profile. However, the shape and amplitude ratios between the principal and secondary peaks depended on the probe's addition time t′ , for the case of probes with larger relative selectivity than the reference (see secondary bumps in A–F). In the case of probes with less selectivity, the transients exhibited variable concavities and lengths, thus led to cell responses with significantly reduced and more unstable firing rates (e.g., purple traces).

# Comparing Selectivity Results in the Model With Experimental Findings

In a previous work on visual selection and color perception Fallah et al. (2007), measured the response of single neurons to a set of stimuli falling within its RF. Cells were located in the V4 extrastriate visual cortex in primates, and tuned to a particular hue. The animal was first exposed to a stimulus at t = 0, and at t = t ′ a second with different hue structure was added. The recordings show a reshaping of the FR-profile in proportion to the relative match between the hue of the stimuli and the selectivity (selectivity) of the cell, producing FR patterns close to those shown in **Figure 4**, and depicted in **Figure 5A**. Ferrera et al. reported similar in-vivo dynamic while recording from cells in areas 7a, MT and V4 (Ferrera et al., 1994). Even though in both studies the outcome of the experiments clearly reflects correlations between the cell's response and featurerelated information of the stimulus, the responsible mechanism was not characterized.

In order to explore the plausibility of the ST-cell dynamics with Ih in explaining those results, we implemented a high level simulation of Fallah's experiment using the circuit from **Figure 1C** The neutral reference (WE−<sup>p</sup> = 0.7) was presented at t = 0 and a probe with larger or smaller selectivity was added at t′ = 300 ms. As a first confirmation of the model's efficacy, we observed that when starting with a neutral reference, the addition of more preferred probe (WE−<sup>p</sup> = 0.80) induced a sharp increase in the FR and a bump with similar characteristics to the effects described in the previous section for probes with selectivity larger than the reference (compare blue traces in **Figures 5A,B**). In turn, a less preferred probe (WE−<sup>p</sup> = 0.6) led to a drop and stabilization of the FR-profile (see red traces in **Figures 5A,B**).

In spite of the qualitative similarities between simulations and experiment, once the second stimulus is added, the experiment shows a brief period of non-responsiveness prior to a sharp modulation of firing which is underestimated in the model, but not necessarily as its flaw.

Since the biological problem suggests that for a particular combination of inputs, the neuron activation remains close to the resting state, the cell may react either by raising its firing, whenever the threshold is reached (generating a silent period of non-sensitive change), or by getting hyperpolarized and in consequence reducing the firing, which does not demand a threshold crossing and in consequence, no insensitive periods are required. Thus, we believe this is an aspect that needs further analysis and to account for the result, experiments using a broader range of selectivities need to be considered in a future study, together with further computational exploration.

### Effects of Attention on the FR-Profile

The most interesting aspect concerning the ST-characterization regards its behavior during attentional tasks. In this section we examine the extent to which attention could or not modulate the dynamics of the cell's selectivity.

As proposed by the Selective Tuning model (Tsotsos, 1990, 2011; Tsotsos et al., 1995), allocating/engaging attention in the model corresponds to the activation of the selection mechanism. Such mechanism was represented by a top down control signal responsible for suppressing information associated to irrelevant stimuli, while keeping unaffected the connections between the cells that processed information related to the attended stimulus in a task-dependent manner. We quantified the suppressive signal by computing the absolute difference between the weighted inputs impinging the top neuron, and used it to multiply the weight of the inputs from the unattended stimulus (see section Methods). This approach has proven to be fast and accurate at disambiguating stimuli, since rather than adding up the weighted contribution of all incoming signals, allows single neurons to efficiently filter them out and focus on the relevant ones. This idea is supported a key observation by Martinez-Trujillo et al., (Martinez-Trujillo and Treue, 2004; Khayat et al., 2010) according to which attention modulates the input to a given neuron instead of its direct response.

Using the circuit in **Figure 1C**, we studied the response of the top neuron when the reference and the probe were presented in isolation and simultaneously. In addition, to track possible variations in the stationary state, the attentional signal remained active until the end of the simulated period.

In agreement with real experiments, and regardless of the amount of selectivity associated to each, when two stimuli of different selectivity were exposed to the scrutiny of the top neuron, the average behavior of the FR-profile fell in between those evoked by each stimulus in isolation; see **Figures 6A,D**. However, in the case of stimuli being simultaneously presented, a late engagement of attention to one of them modulated the cell's FR and forced it to adjust it to the magnitude evoked by the attended stimulus regardless of its selectivity, consistent with the theory (Martinez-Trujillo and Treue, 2004). The behavior is shown in **Figures 6B,C**, where the neutral reference (WE−ref = 0.7) and the probe with less selectivity (WE−<sup>p</sup> = 0.6) were both located inside the classical receptive field of the top neuron and simultaneously presented at t = 0 ms. When attention was allocated at t′ = 50, 100, 200, 400, and 600 ms, the FR rose or dropped accordingly to what stimulus was attended. Similar effects were obtained when the selectivity of the probe (WE−<sup>p</sup> = 0.8) was larger than that of the reference, as shown in **Figures 6E,F**.

Irrespective of what stimuli was considered reference or probe, engaging attention to that of larger selectivity led the FR-profile to generate larger bumps (maxFR) than those observed for the attention away condition (dashed traces in **Figures 6C,E**); and FR with magnitude similar to the FR evoked by the largest stimulus in isolation. On the other hand, engaging attention to the stimulus with less selectivity produced FR-profiles characterized by troughs initiated at t′ . In the case of **Figure 6B** the depth of the transient was more profound than in the case of the traces in **Figure 6F**, although in both cases the stationary response of the FR-profile coincided with that of the stimulus with less selectivity for the attention-away condition.

# Comparing the Effect of Attention in the ST-Neuron With Experimental Recordings

**Figures 7A,B** correspond to the simulated conditions in which attention was either engaged to the reference with less selectivity (**Figure 7A**) or not allocated at all (**Figure 7B**). Interestingly, the resulting FR-profile in the first case shows a masking effect of attention that, in spite of a probe having larger selectivity than

neurons of primates, adapted from Fallah et al. (2007). The vertical dotted line indicates stimulus appearance, and the continuous black lines the period over which modulation of the response was computed. The red trace indicates the population response for the "preferred" (P) stimulus alone followed by the addition of the non-preferred (NP); while the blue trace indicates the non-preferred alone, followed by the addition of the preferred. (B) Simulated experiment. Both traces represent the response of the neuron when a neutral stimulus was presented followed by the addition of the preferred stimulus (red trace), or the non-preferred one (blue trace). The dashed line indicates the time at which the probe addition occurred and the continuous line the time of the transient's peak.

FIGURE 6 | Engaging attention modulates the transients, and modifies the amplitude of the stationary response. Reference and probe stimuli were simultaneously presented and attended as indicated for each trace (see labels). The stimuli were presented at t = 0 ms and attention was engaged at t′ = 50, 100, 200, 400, 600 ms after stimulus presentation. (A,D) Show the reference and probe stimuli presented in isolation (blue and red traces) and simultaneously (yellow traces). In the first scenario, the reference is stronger and in the second the probe is stronger (higher selectivity). As expected the cell's selectivity mechanism produced firing rates with well differentiated maxFR's, each proportional to the respective selectivity of the stimulus. In addition, simultaneous reference and probe presentation, led to FR-profiles with intermediate amplitudes. Experiments were run for attention oriented to the probe (B,E), and attention oriented to the reference (C,F). Besides the characteristic transients, directing attention to the probe shifted the tail of the fr-baselines to the profile produced by the probe alone. Attending the reference produced similar effect on the fr-baseline, shifting in this case the tails of the response toward the reference alone FR-profile. Those long rate responses were consistent and irrespective of the relative selectivity between the reference and the probe.

the reference, the FR gets modestly disrupted, remaining locked to the FR-profile of the reference. It contrasts the effect observed for the attention-away condition, in which the selectivity led the cell to rapidly increase the FR and adjust the FR-profile, matching that evoked by the probe alone, in this case with larger selectivity.

In an experimental study Luck et al. (1997) measured single cell responses of neurons located at V4 associated to the appearance of a particular target. Stimuli were defined as effective or ineffective on a selectivity basis. In their protocol a series of trials consisted in presenting sequentially/simultaneously pairs of simple stimuli characterized by color and orientation, which could be both inside the cell's receptive field, or one inside and the other outside it, and attention was deployed to one of the two regions. For further details please refer to Luck et al. (1997). By comparing our results with those experimental recordings (**Figures 7A,C** respectively), the simulation shows good agreement, not only in the shape, but also in the time course of the FR-profile. In contrast to the condition observed in those figures, **Figure 7B** shows that in the absence of attention (attention away condition) there is no masking at all of the scene, and any probe stimulus with larger selectivity than the reference will draw the largest part of the cell response when both stimuli are located inside the RF. As in the experiment, **Figure 7A** shows the response of the top neuron after presenting the reference and probe simultaneously at t = 0, and the attentional mechanism is deployed at t = t ′ . Both simulation (**Figure 7A**) and experiment **Figure 7C** are characterized by a small modulatory dent in the cell's FR-profile while attending a less selective reference. The match between model and experiment suggests that in effect from the model's perspective, Ih makes the neuron highly sensitive to the effects of attention on selectivity (recall that in the absence of Ih the cell reached saturation, and the FR couldn't be modulated, see red trace in **Figure 1A**), but also from the biological perspective, the model suggests that attention and selection compete for resources when stimuli with low selectivity are attended. However, as it will be discussed later, the results in **Figure 8** show that collaborative enhancement is also possible.

# Attention Competes Against or Reinforces Neural-Selectivity

In our final experimental design, we ran simulations in which the reference was presented at t = 0 and the probe was presented and attended at t' = 50, 100, 200, 400, and 600 ms. Probes had either larger or smaller selectivity than the reference. In the attention away condition, a probe with less selectivity than the reference produced a decaying FR-profile characterized by shallow troughs and durations of the transient close to 150 ms, followed by a slow recovery of the FR in the direction of the stationary state (**Figure 8D**). In the same condition, probes with larger selectivity than the reference created rebounding firing rates with increasing amplitude, especially for late stimulus onset t′ .

Running the same set of experiments while attention was allocated to the probe at t′ simultaneously with the probe's presentation, shows that attention has an ambiguous effect depending on whether the transient or the stationary dynamics of the cell's response were analyzed. As reference, **Figures 8A,B** show the FR-profile of the ST-model neuron in the attention away condition. All traces show that consistent with previous studies (Martinez-Trujillo and Treue, 2004), and based on its selectivity, the cell has larger maxFR for a more preferred stimulus and vice versa, while when the pair is active, the response always falls in between the FR-profile of the other two.

The effect of the selection mechanism of attention seemed to have a transitory component characterized by reinforcement of the cell's selectivity, while in the long term its behavior turned competitive. Although the affirmation may look contradictory, a careful check of **Figures 8B,C** shows that although the depth of the trough is larger for the attend-to-probe scenario, suggesting a steeper reduction of the FR (inhibition's reinforcement), the cell's response to the same onsets of the probe (indicated by traces of the same color in both figures) also corresponds to shorter widths (duration) of the trough in the attend-to-probe condition. In turn, when the FR was restored, the FR-profile matched that of the probe alone, in contrast to the attention-away condition (**Figure 8B**), in which the stationary state matched the FR-profile of the pair.

Interestingly, when a probe with larger selectivity than the reference was presented, it resulted in the opposite response of the neuron. A comparision between individual colored traces in **Figures 8E,F** shows that due to its large selectivity, a bump in the FR-profile occurred almost after the probe's onset in the attention away scenario, and that its magnitude increased by increasing the delay t′ between the onset of the reference and the probe, in a non-linear fashion (see bumps in **Figure 8E**). In the stationary state the solely effect of selectivity led the cell's FR-profile to match the response evoked by the pair.

In contrast, when the attentional mechanism was turned on while presenting the probe, the reduction in firing was represented by a deep and short trough characterizing the transitory response, exhibiting a duration of around 20 ms, similar for all t′ , and depth with magnitude near to 20% of the maximum FR, except for t′ = 50 ms, (close to 30%).

This period that we called "latency," preceded a bump in the FR-profile whose peak FR, was similar for most t′ , and in general larger than the maxFR of the cell obtained when the pair was active, as shown in **Figure 8F**. Consistent with the case of the troughs, the peak of the bump for t′ = 50 ms was also slightly larger than for any other t′ , suggesting that a short delay between the probe's onset and the activation of the attentional mechanism eases the processing of the stimulus of interest. Regarding the stationary response, we found the engagement of the FR to the response obtained when the probe was presented alone, in contrast to the attention away scenario, in which the FR was engaged to the FR-profile of the pair (see **Figures 8E,F**). It is important to note that in all simulations we implemented the selection mechanism of attention proposed by the ST model, which is based on inhibition of non-relevant inputs. In an earlier work by Busse et al. (2008), shifting attention from a cue located outside or inside of an MT cell's receptive to a probe in the opposite region was preceded by a drop in the firing rate of the cell. Authors claimed that the "short-latency decrease of responses" was caused by an interruption of endogenous attention, due to focusing

outside the still attended spot, the incoming stimulus barely altered the instantaneous firing of the cell, and only produced a negligible bump on the FR-profile at the exchange time. Condition (B) represents an identical setup to the one described in (A), but in this case attention was not engaged to any of the stimuli. The absence of attention forced the firing of the cell to get immediately locked to the incoming input. The traces in (C) were adapted from Luck et al. (1997). They show that adding a probe with high selectivity to the receptive field of the top cell, while attending the reference also in the same receptive field, barely affects the ongoing response evoked by the reference.

on a stimulus that delayed the expected response toward the target.

### DISCUSSION

By restricting our analysis to the case in which attention switches from the outside to the stimulus in the inside (red trace in **Figure 9A**), similar to the Busse et al. experiment, our findings show a two-step process: first a drastic drop in the FR, and second, the steep recovery of firing that precedes a bump. It validates our observation that when a cell is initially active due to a cue with certain selectivity, attention leads the single cell's response to a brief interruption in the FR, represented by short and deep troughs in the FR-profile, regardless of the selectivity of a second stimulus; and to recover the FR following a time course whose shape (**Figure 9A**) is closely resembled by the model, as depicted by the red traces in **Figures 9B,C**. In our simulations the circuit in **Figure 1C** was initially exposed to the effect of a neutral reference (WE−<sup>p</sup> = 0.7) and at t = t ′ a probe with more/less selectivity was added to the cell's receptive field and attended. The model predicts a deeper trough for the preferred probe (WE−<sup>p</sup> = 0.8) (**Figure 9B**) than for a non-preferred probe (WE−<sup>p</sup> = 0.6) (**Figure 9C**), and both latencies having similar duration. However, additional experiments are required for a solid validation of this point. The study also suggests that the intention of switching attention generates a similar effect (black trace in **Figure 9A**), but because that there is no optimal way to simulate the intention of switching attention in the model, we represented that condition by leaving the reference stay during the whole simulation (see black traces in **Figures 9B,C**).

Attention is responsible for modulating the amount of input received by a neuron from the stimulus in its RF. In order to quantify the nature and magnitude of this modulatory effect, earlier studies (Pestilli et al., 2007) have reported significant correlation between attention and the dynamics of the threshold and contrast sensitivity processed single neurons, supporting some of their claims on the results of computational studies like the biased competition (Reynolds et al., 1999) and the multiplicative response gain model, that endow attention with an enhancment role of single neuron's activity (McAdams and Maunsell, 1999; Williford and Maunsell, 2006). In a theoretical study, Ladenbauer et al. (2014) presented a description of the effects of adaptation mechanisms, on the single cell's firing rate, highlighting a major influence on the gain of firing and threshold modulation, that agrees with the idea that external inhibitory synaptic inputs are relevant modulators of the input-output curve of single neurons.

A second intriguing element concerns the eventual generation of transients (bumps and troughs) in the firing rate of single cells (Martinez-Trujillo and Treue, 2004; Fallah et al., 2007; Busse et al., 2008), when a rapid stimulus switch takes place during attentional tasks, and that this particular response is due to suppression of irrelevant stimuli as previously posed by Lennert and Martinez-Trujillo (2011). In an earlier paper, Tsotsos (1990) first predicted such behavior, suggesting that inhibition of

FIGURE 8 | Presenting and attending a probe modulates the cell's selectivity effect. The reference was presented at t = 0 and the probe presented and attended as indicated for each color trace at t′ = 50, 100, 200, 400, 600 ms (see labels). (A,D) Show the firing rate for the stimuli presented in isolation (reference -blue, probe -red), and simultaneously (yellow trace). The three curves are also shown as dashed lines in (B,C,E,F) when the probe was added to the reference (both located inside the cell's receptive field) it has the effect to increase or reduce the cell's firing according to the cell's selectivity for the probe. In the attention away scenarios (B,E) the sole effect of selectivity, characterized by transients and baseline shifts, was observed. When Attention was engaged to the probe at t′ , as shown in (C,F) it induced the occurrence of large transients with sharp changes of concavity, whose magnitude significantly depended on the respective cell's selectivity to the probe, relative to the response in the attention away scenarios. In addition, the magnitude of maxFR in the rebounding conditions were in average 30% larger, with slower decay times and tails shifting toward the FR-profile of the probe alone for more preferred probes, and toward the curve of the reference alone for probes with less selectivity, while in the attention away scenario the stationary response converged toward the profile evoked by both stimuli simultaneously presented.

distractors allows the target neuron to restore its firing rate to the level evoked by the attended stimulus in isolation.

In this study we presented a revisited version of the ST neuron model, and characterized the effect on the firing rate of incorporating adaptation currents (Ih) into its dynamic equation, quantifying the neuron's response when submitted to various simulated experiments. We also strengthen the results of Rothenstein and Tsotsos (2014) describing the capabilities of the ST-neuron in reproducing experimental FR-profiles observed in simple attentional tasks, by separating the effects related to the cell's selectivity when Ih currents were active, from those related to attention. To our knowledge, this is the first time that adaptation current mechanisms are combined with an inhibition based model of the top-down attentive signal, to study the response of neurons in the visual cortex during attentive states.

With regard to the ST-neuron characterization, we found that in the absence of further mechanisms, the time course of the firing rate was driven by the balance between the constant σ of the Naka-Rushton term and the characteristic decay time of the inhibitory inputs. In turn, the modulation provided by Ih (depicted in **Figures 1A**, **2F**) determined the existence of two regions in the FR-profile: the first quantifying the variability of the initial FR activation, and the second the post-saturation effect. Using a similar circuit to the originally proposed by Reynolds, we simulated the activation of V4 neurons, showing that selectivity creates a strong differentiation between patterns of response (FR-profiles), each possessing a unique maxFR (peak FR) and a stationary rate, correlated to the relevance of the input for the neuron. As an important aspect, the obtained FR-profile could be linked to different features of the stimulus or even to the whole stimulus (as in the case of V4 neurons) being represented not only by variations in the contrast or firing threshold.

The biological plausibility of the ST-neuron proved to be successful at reproducing different experimental scenarios, by only modulating the relation between inputs weights representing each stimulus. Our simulation of Fallah et al. experiment (Fallah et al., 2007), highlights the modulatory effect of Ih to reshape the FR, when responding to stimuli

with significant differences in the selectivity in the absence of attention. Although the model predicts changes in the transitory state of the FR, further experiments are required to verify the prediction.

The significance of the Ih dynamics proved its relevance also in more complex scenarios that included activation of the attentional signal. As described in Results, we showed that by incorporating the selection mechanism of attention proposed by Tsotsos (1990), the FR-profile resembled the response of real V4 neurons, and that by using Reynold's design (**Figure 1C**), as seen in **Figure 7**. A no enhancement is necessary to account for the time course of the firing rate when stimuli with different levels of neural selectivity are presented in isolation or simultaneously. Furthermore, our simulations show that by including the activation of the attentional mechanism, the FR was able to differentially represent possible conditions for the onsets of attention, or its shift in a non-redundant way, for different experimental designs, regardless of how similar can be the stimuli. In this scenario we show the interplay between selectivity and attention (**Figures 6A,B**) is crucial to define the dynamics of the FR when two stimuli suddenly switch with each other, affecting both the transitory and the stationary phases of the FR-profile. We predict the existence of a dual role played by attention, in which it can enhance or compete against selectivity during the transitory stage, and the opposite during the stationary stage, depending on how preferred each stimulus is for the neuron. The plausibility of our results is strongly backed up by the significant resemblance obtained by simulating the Luck et al. (1997), and Busse et al., experiments (Busse et al., 2008), in which the change of selectivity in the first (**Figure 7C**) together with the deployment of attention, and the shift of the focus of attention in the second (**Figure 9**), are well accounted by the significant changes in both phases of the FR-profile. Overall, the behavior of the ST-model reflects the context-based competitive or enhancing effect of the cross-talk between attention and selectivity.

Our results coincided with the claim posed by the ST-model (Tsotsos, 1990; Tsotsos and Rothenstein, 2011) that suppressing irrelevant activity in the surround of the attentional focus forces the cell to adapt its firing and match the rate evoked by the attended stimulus in isolation, in the sense that when attended, the FR-profile of the neuron in all simulations depended on its selectivity to that stimulus regardless of stimulus context. It made the response produced by all stimuli within the receptive field to be larger in the unattended scenario than when one of them was attended, due to the presence of distractors with high selectivity in the surrounding.

Since a significant amount of the information was encoded by the transient (latency), we hypothesize that this period of average duration in the range 20–30 ms, during which the firing rate suddenly drops and raises, could be required for the cells to reaccommodate to the confluent and ongoing bottom-up effect of selectivity and the top-down signal of attention; however, future work will require experiments in single cells and populations to test the functioning principles of the latency periods, so as to characterize their time courses. Secondly, based on our hypotheses it will be necessary to also check if the interplay between attention and selectivity is enough to fully disambiguate stimuli with complex combinations of features within a single visual scene.

#### REFERENCES


### AUTHOR CONTRIBUTIONS

This research work was carried out in collaboration between all authors. OA and JT defined the research theme. OA and JT designed methods and simulations, OA analyzed the data, OA and JT interpreted the results and wrote the paper. OA and JT discussed analyses, interpretation, and data presentation. All authors have contributed to, seen and approved the manuscript.

#### FUNDING

This research was performed in the frame of the STAR project funded by the Air Force Office of Scientific Research; Grant no. FA9550-14-1-0393.

#### ACKNOWLEDGMENTS

We want to express our gratefulness to Professor Julio-Cesar Martinez-Trujillo and his research group at Western University in London Canada, for valuable discussions.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Avella Gonzalez and Tsotsos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Neurodynamic Model of Feature-Based Spatial Selection

#### Mateja Maric and Dražen Domijan ´ \*

Department of Psychology, Faculty of Humanities and Social Sciences, University of Rijeka, Rijeka, Croatia

Huang and Pashler (2007) suggested that feature-based attention creates a special form of spatial representation, which is termed a Boolean map. It partitions the visual scene into two distinct and complementary regions: selected and not selected. Here, we developed a model of a recurrent competitive network that is capable of state-dependent computation. It selects multiple winning locations based on a joint top-down cue. We augmented a model of the WTA circuit that is based on linear-threshold units with two computational elements: dendritic non-linearity that acts on the excitatory units and activity-dependent modulation of synaptic transmission between excitatory and inhibitory units. Computer simulations showed that the proposed model could create a Boolean map in response to a featured cue and elaborate it using the logical operations of intersection and union. In addition, it was shown that in the absence of top-down guidance, the model is sensitive to bottom-up cues such as saliency and abrupt visual onset.

#### Edited by:

Hedva Spitzer, Tel Aviv University, Israel

#### Reviewed by:

Xavier Otazu, Universitat Autònoma de Barcelona, Spain Marius Usher, Tel Aviv University, Israel David Golomb, Ben-Gurion University of the Negev, Israel

> \*Correspondence: Dražen Domijan mmaric2@ffos.hr; ddomijan@ffri.hr

#### Specialty section:

This article was submitted to Perception Science, a section of the journal Frontiers in Psychology

Received: 17 May 2017 Accepted: 13 March 2018 Published: 28 March 2018

#### Citation:

Maric M and Domijan D (2018) A ´ Neurodynamic Model of Feature-Based Spatial Selection. Front. Psychol. 9:417. doi: 10.3389/fpsyg.2018.00417 Keywords: boolean map, feature-based attention, lateral inhibition, neural network, winner-take-all

# INTRODUCTION

In the literature on visual attention, significant progress has been made in characterizing the principles of selection. Visual attention can be allocated flexibly to a circumscribed region of space, the whole object or feature dimensions such as color and orientation (Nobre and Kastner, 2014). Indeed, early work suggested that a restricted circular region of space is a representational format of attentional selection. Posner (1980) proposed that attention operates like a spotlight that highlights a single circular region of space with a fixed radius. All locations that fall inside the spotlight are selected, and everything outside is left out. An extension of this proposal, which is called the zoomlens model, suggests that the spotlight of attention can change its radius depending on the spatial resolution that one wants to achieve (Eriksen and St. James, 1986). If high resolution is required, the spotlight can be narrowed to capture details in the selected region, whereas the radius of the spotlight can be widened when a lower resolution is sufficient.

Other studies point to an object as a unit of selection. Duncan (1984) showed that it is easier to report two attributes if they appear on the same object, relative to the scenario in which each attribute appears on a different object. This finding implies that the object is selected as a whole and has been replicated many times using different stimuli and behavioral paradigms (Scholl, 2001). This effect cannot be explained by spatial attention because objects were spatially superimposed, that is, they shared the same locations. More recently, it was shown that attention can also be allocated to a visual feature such as color or direction of motion independent of spatial location (Saenz et al., 2002, 2003). Single-unit recordings have shown that feature-based attention is accompanied by the global location-independent modulation of neural response in a range of areas in the visual cortex. Attentional modulation was described as a multiplicative gain change that increases responses of neurons that are selective to attended feature values and decreases responses of neurons that are tuned to unattended feature values (Treue and Martinez-Trujillo, 1999; Martinez-Trujillo and Treue, 2004).

Object-based attention, however, is not necessarily detached from spatial representation. There is behavioral and neurophysiological evidence that object-based attention involves selection of all spatial locations that are occupied by the same object. Specifically, it was suggested that attention selects a grouped array of locations (O'Grady and Müller, 2000). In other words, attention spreads from one spatial location along the shape of the object and highlights all locations that belong to the object (Richard et al., 2008; Vatterott and Vecera, 2015). Neurophysiological studies showed that object-based selection is indeed achieved by the spreading of the enhanced firing rate along the shape of the object (Roelfsema, 2006; Roelfsema and de Lange, 2016).

In a similar way, feature-based attention might involve the selection of all locations that are occupied by the same feature value, as shown by Huang and Pashler (2007). They proposed that attention is limited because it may access only one feature value (e.g., red) per dimension (e.g., color) at any given moment. However, the accessed feature value is bound to space in parallel, without capacity limits. Feature-based attention is allocated in space via the formation of a binary or Boolean map. When a conscious decision is made to attend to a specific feature value, the Boolean map indicates all spatial locations that are occupied by the chosen feature value because they are labeled by a positive value (e.g., 1), while all other locations are labeled with zero. In each selection process, selected locations need not be contiguous in space, but they must share the same feature value. After a Boolean map is formed, it is possible to operate on its output by applying the set operations of intersection and union. Recent work suggests that a spatial representation, such as a Boolean map, might mediate perceptual grouping by similarity (Huang, 2015; Yu and Franconeri, 2015). Moreover, the idea has been recently applied successfully in the computer vision literature on developing algorithms for saliency detection (Zhang and Sclaroff, 2016; Qi et al., 2017).

**Figure 1** illustrates a Boolean map that is formed in response to three different stimulus configurations and sequential application of two top-down feature cues. **Figure 1A** shows a simple stimulus that consists of red and green squares. An observer might attempt to isolate only red or only green items. To do so, a top-down cue should be supplied to the feature map that encodes the desired feature value. For example, when attention is directed to the red color, the top-down cue highlights all locations that are occupied by red squares. The Boolean map picks up on this feature cue and forms a spatial representation in which cued locations are labeled with 1 (white) and non-cued locations are labeled with 0 (black). In terms of a neural network, these labels correspond to the active (excited) and inactive (inhibited) states of the corresponding nodes in the network (Boolean Map – 1). Later, the observer might wish to switch to green color (Boolean Map – 2). Again, in a response to a new feature cue, the Boolean map now shows all locations that are occupied by green squares.

**Figure 1B** shows a typical stimulus that is used in visual search experiments. It consists of red and green horizontal and vertical bars. The task is to find a red horizontal bar. This is an example of a conjunction search task in which two feature dimensions should be combined to find the target object. According to Huang and Pashler (2007), the conjunction task is solved in two steps. In the first step, a Boolean map is formed by top-down cueing of red items, irrespective of their orientations. In the second step, only horizontal items are cued. However, since red items have already been selected, the second Boolean map will correspond to the intersection of red and horizontal items. There is only one item that satisfies these selection criteria: the target. In this way, visual search is substantially faster compared to the strategy of sequentially visiting each item by moving the attentional spotlight across the visual field. It is also possible to reverse the order of the applied feature cues. In the first step, horizontal items might be cued, and the intersection is formed by highlighting red items in the second step. Importantly, there is behavioral evidence that observers indeed implement such a subset selection strategy in conjunction search tasks (Egeth et al., 1984; Kaptein et al., 1995). Moreover, Huang and Pashler (2012) showed that the same strategy is used in the perception of spatial structure in a stimulus that is composed of multiple items that differ in several dimensions.

**Figure 1C** illustrates an example of the union of two Boolean maps. As in the previous example, the observer starts by cueing red items and creating a Boolean map that consists of a representation of their locations. In the second step, the observer wishes to combine red with horizontal items. Therefore, in the second step, one should cue horizontal items but simultaneously maintain locations of the remaining items in memory. The resulting new Boolean map now represents the locations of all red and all horizontal items that were found in the image. Computing with Boolean maps might not be restricted to only two steps, as **Figure 1** suggests. It is possible to incorporate more feature dimensions, such as motion, texture, or size, that can also be engaged in creating Boolean maps that are more complicated.

Feature-based spatial selection, as illustrated by the Boolean map, provides a strong constraint on the computational models of visual attention because it requires simultaneous selection of arbitrarily many locations based on an arbitrary criterion that is set by the observer. Computational models of attention often rely on a winner-take-all (WTA) network to select a single, most salient location from the input image (Itti and Koch, 2000, 2001). The WTA network consists of an array of excitatory nodes that are connected reciprocally with inhibitory interneurons. This anatomical arrangement creates lateral inhibition among excitatory nodes that lead to the selection of a single node that receives maximal input and the suppression of all other nodes, which receive non-maximal input. However, when faced with the input where multiple (potentially many) nodes share the same maximal input level, the typical WTA network tends to suppress all winning nodes due to a strong mutual inhibition among them instead of selecting them together. For example, Usher and Cohen (1999) showed that, under the conditions of strong recurrent excitation and weak lateral inhibition, the WTA network reaches a steady state with multiple active winners.

Importantly, activation of the winning nodes decreases linearly toward zero as their quantity increases. In other words, this network design suffers from the capacity limitation. This is a useful property in modeling short-term memory and frontal lobe function (Haarmann and Usher, 2001) but it is inadequate for understanding how the Boolean map might arise in a large retinotopic map, as exemplified by **Figure 1**.

Another problem is that the dynamics of the WTA network are not sensitive to transient changes in the input amplitude. Due to strong self-excitation and the resulting persistent activity, the WTA network settles into one of its memory states (fixed points). Importantly, each memory state is independent of later inputs. If self-excitation is weakened, the network will become sensitive to input. However, at the same time, it will lose its ability to form a memory state and will behave like a feedforward network (Rutishauser and Douglas, 2009). One way to solve this problem is to apply an external reset signal to the network before a new input is processed (Grossberg, 1980; Kaski and Kohonen, 1994; Itti and Koch, 2000, 2001). However, this is not sufficient in the context of feature-based attention. An intersection or union operation between two Boolean maps requires that the currently active memory state (formed after the first feature cue) be updated by taking into account new input (the second feature cue). Therefore, the dynamics of the WTA network should allow uninterrupted transition between memory states that are governed by external inputs. In other words, the WTA network should be capable of state-dependent computation (Rutishauser and Douglas, 2009).

To summarize, a WTA network that is capable of computing with Boolean maps should simultaneously satisfy two computational constraints:

1. It should be able to select together all locations that share a common feature value. This should be achieved without degrading the representation of the winners.

2. It should exhibit state-dependent computation, in which new inputs are combined with the current memory state to produce a new resultant state (e.g., intersection or union).

Here, we have developed a new WTA network that satisfies these constraints and provides the neural implementation of the Boolean map theory of attention (Huang and Pashler, 2007).

### MODEL DESCRIPTION

The aim of the current work is to provide an explanation of how a Boolean map may be formed in a recurrent competitive network that can implement feature-based winner-take-all (F-WTA) selection. To this end, we have extended the previously proposed network model based on the linear-threshold units (Hahnloser, 1998; Hahnloser et al., 2003; Rutishauser and Douglas, 2009). Concretely, the model circuit is presented in **Figure 2**. It consists of a single inhibitory unit, which is reciprocally connected to a group of excitatory units. In addition to these basic elements, we introduce two processing components into the WTA circuit to expand its computational power. The first is a dendritic nonlinearity, which prevents excessive excitation that arises from self-recurrent and nearest-neighbor collaterals. We modeled the dendritic tree as a separate electrical compartment with its own non-linear output that is supplied to the node's body (Häusser and Mel, 2003; London and Häusser, 2005; Branco and Häusser, 2010; Mel, 2016). The second is modulation of synaptic transmission by retrograde inhibitory signaling (Tao and Poo, 2001; Alger, 2002; Zilberter et al., 2005; Regehr et al., 2009). This is a form of presynaptic inhibition, where postsynaptic cells release a neurotransmitter that binds to the receptors that are located on the presynaptic terminals. Retrograde signaling creates a feedback loop that dynamically regulates the amount of transmitter that is released from the presynaptic terminals. Here, we have hypothesized that such interactions occur in recurrent pathways from the excitatory nodes to the inhibitory interneuron and back from the interneuron to the excitatory nodes. In the excitatory-to-inhibitory pathway, retrograde signaling enables the inhibitory interneuron to compute the maximum instead of the sum of its inputs. Computation of the maximum arises from the limitation that the activity of the inhibitory interneuron cannot grow beyond the maximal input that it receives from the excitatory nodes. Furthermore, retrograde signaling in the inhibitory-to-excitatory pathway enables the excitatory nodes that receive maximal input to protect themselves from the common inhibition. In this way, the network can select all excitatory nodes with maximal input, irrespective of their quantity or arrangement in visual space.

At first sight, it might appear strange to propose that an excitatory unit can inhibit its input by releasing a neurotransmitter that binds to the presynaptic terminal. However, several signaling molecules have been identified to support such interactions, including endogenous cannabinoids (Alger, 2002). Moreover, Zilberter (2000) found that glutamate is released from dendrites of pyramidal neurons in the rat neocortex and suppresses the inhibition that impinges on them. In addition, similar action has been found for GABA (Zilberter

et al., 1999), which suggests that conventional neurotransmitters can engage in retrograde signaling.

To situate the proposed F-WTA circuit in a larger neural architecture that describes the cortical computations that underlie top-down attentional control, we have adopted the model that was proposed by Hamker (2004). He showed how attentional selection of a target arises from the recurrent interactions within a distributed network that consists of model cortical area V4, the inferotemporal cortex (IT), the posterior parietal cortex (PPC), and the frontal eye fields (FEF). **Figure 3** illustrates part of these interactions that are involved in featurebased attentional guidance. Top-down signals that provide feature cues originate in the IT, which contains a spatially invariant representation of relevant visual features. The IT sends feature-specific feedback projections to the V4, where topographically organized feature maps for each feature value are located. For simplicity, we consider only maps for two colors (red and green), and two orientations (vertical and horizontal). We do not explicitly model IT and V4 dynamics. Rather, they serve here as a tentative explanation of how input to the F-WTA network

arises within the ventral visual pathway. Also, we omitted the contribution of the FEF and its spatial reentry signals to the V4 activity.

We hypothesize that the feature-based WTA network resides in the PPC, where it receives summed input over all feature maps from the V4. Top-down guidance is implemented by a temporary increase in activity in one of the V4 feature maps. For example, when the decision is made to attend to the red color, the IT representation of red color sends feedback signals to the Red Map in the V4. Top-down signals to the feature map are modeled as a multiplicative gain of neural activity, which is consistent with neurophysiological findings (Treue and Martinez-Trujillo, 1999; Martinez-Trujillo and Treue, 2004; Maunsell and Treue, 2006).

The following neural network equations represent the quantitative description of the model. Each unit is defined by its instantaneous firing rate (Dayan and Abbott, 2000). The time evolution of the activity of excitatory node x at position i in the recurrent map is given by the following differential equation:

$$\mathbf{r}\_{\mathbf{x}}\frac{d\mathbf{x}\_{i}}{dt} + \mathbf{x}\_{i} = \left[I\_{i}\left(t\right) + \alpha f\left(\mathbf{x}\_{i} + \mathbf{x}\_{i+1} + \mathbf{x}\_{i-1}\right) - \beta\_{1}\mathbf{g}\left(\mathbf{y} - \mathbf{x}\_{i} - T\_{\mathbf{y}}\right)\right]^{+}.\tag{1}$$

The time evolution of the activity of inhibitory interneuron y is given by

$$
\pi\_\mathcal{V} \frac{d\boldsymbol{y}}{dt} + \boldsymbol{\y} = \left[ \boldsymbol{\beta}\_2 \sum\_i \boldsymbol{g} \left( \boldsymbol{\kappa}\_i - \boldsymbol{y} - T\_\mathbf{x} \right) \right]^+. \tag{2}
$$

Parameters τ<sup>x</sup> and τ<sup>y</sup> are integration time constants for excitatory and inhibitory nodes, respectively. We assume that inequality τ<sup>x</sup> > τ<sup>y</sup> holds, which accords with the observation in electrophysiological measurements that inhibitory cells exhibit faster dynamics than excitatory cells (McCormick et al., 1985). The second term on the left-hand side of Equations (1) and (2) describes the passive decay that drives the unit's activity to the resting state in the absence of external input. Firing rate activation function [u] <sup>+</sup> is a non-saturating rectification nonlinearity, which is defined by

$$\left[\left[u\right]\right]^{+} = \max\left(\left.u, 0\right\rangle. \tag{3}$$

Following Hamker (2004), we assume that feedforward input I<sup>i</sup> at time t to the excitatory node x<sup>i</sup> in the F-WTA network is given by the sum over activity in all V4 feature maps I (m) i ,

$$I\_i(t) = \sum\_m I\_i^{(m)} G^{(m)}\ (t). \tag{4}$$

In Equation (4), m denotes available feature maps with m ∈ red, green in the simulation that is reported in section Simulation of the Formation of a Single Boolean Map and m ∈ red, green, horizontal, vertical in the simulation that is reported in section Simulation of the Intersection and Union of Two Boolean Maps. Parameter G <sup>m</sup> refers to the feature-specific, global multiplicative gain that all units I (m) <sup>i</sup> within the same feature map m receive via top-down projections. As shown in **Figure 2**, these projections arrive from the feature representation in the IT. Multiplicative gating is generally consistent with previous models that describe the effect of feature-based attention on the responses of neurons in the early visual cortex (Boynton, 2005, 2009). Equation (4) ensures that the F-WTA network is not particularly sensitive to any feature value. Rather, it signals the behavioral relevance of locations in a spatial map. Here, the relevance can be set according to differences in the bottom-up input I (m) i that arise from competitive interactions in the early visual cortex. Alternatively, relevance can be signaled by the topdown feature cues G <sup>m</sup> that change the gain of all locations that are occupied by the same feature value.

Dendritic output f(u) is described by the sigmoid response function

$$f\left(u\right) = \frac{S\_d}{1 + e^{-\lambda\left(\mu - T\_d\right)}}\tag{5}$$

where λ and T<sup>d</sup> control the shape of the sigmoid function and S<sup>d</sup> is its upper asymptotic value. We set λ to a high value to achieve a steep rise of the dendritic activity immediately after its input crosses the dendritic threshold, which is denoted as Td. Such strong non-linearity is justified by experimental data, which show all-or-none behavior in real dendrites (Wei et al., 2001; Polsky et al., 2004). In Equation (1), parameter α controls the strength of the impact that the dendritic compartment exerts on the soma.

Self-recurrent x<sup>i</sup> and nearest-neighbor collaterals xi−<sup>1</sup> and xi+<sup>1</sup> arrive on the dendrite of the excitatory node, which is consistent with the anatomical observation that most recurrent excitatory connections are made on the dendrites of the excitatory cells (Spruston, 2008). Nodes at the edge of the network receive excitation only from a single available neighbor. That is, node x<sup>1</sup> receives excitation only from x2, and x<sup>N</sup> receives excitation only from xN−1. Nearest-neighbor excitatory interactions enable feature cues to spread activity enhancement automatically to all connected locations that contain a given feature value. This is not essential for the simulation of Boolean maps but we included it in our model because recurrent connections among nearby neurons are prominent feature of the synaptic organization of the cortex (Douglas and Martin, 2004). Also, we wanted to show that the proposed model is capable of simulating object-based attention (Roelfsema, 2006; Roelfsema and de Lange, 2016). Moreover, Wannig et al. (2011) found direct evidence for activity spreading among neurons that encode the same feature value in the primary visual cortex.

The output of the presynaptic interactions g(u) is defined by the rectification non-linearity of the form

$$\lg(u) = [u]^+ = \max\left(u, 0\right). \tag{6}$$

In Equation (1), the term − g(y − x<sup>i</sup> − Ty) describes the output of the presynaptic terminal that delivers inhibition from interneuron y to excitatory node x<sup>i</sup> (**Figure 4A**). However, we did not explicitly model the dynamics of retrograde signaling. We assumed that the release of the retrograde transmitter occurs simultaneously with the activation of the postsynaptic node and that it is proportional to its firing rate. Therefore, it is represented by the term − x<sup>i</sup> .

Function g(u) ensures that the presynaptic terminal will release the inhibitory transmitter only when the electrical signal from node y exceeds the inhibitory retrograde signal -x<sup>i</sup> and the threshold for presynaptic activation, which is denoted as Ty. In other words, node x<sup>i</sup> will be inhibited only if y > x<sup>i</sup> + Tx. If this is not the case, node x<sup>i</sup> will effectively isolate itself from the inhibitory influence of node y. This is always the case for the winning node because x(t) > y(t) for t > 0. Moreover, this result extends to all other nodes whose input magnitude is sufficiently close to the maximal input. The strength of the inhibition is determined by parameter β1. In a similar vein, in Equation (2), the term −g(x<sup>i</sup> – y – Tx) describes the action of the retrograde signal that is released from inhibitory interneuron y on the presynaptic terminal that delivers excitation from node x<sup>i</sup> (**Figure 4B**). Here, parameter T<sup>x</sup> describes the threshold for the activation of the presynaptic terminal of the excitatory node and β<sup>2</sup> determines the strength of the excitation.

We have proposed a model of a one-dimensional network, although it attempts to simulate phenomena that occur in 2-D, as illustrated by **Figure 1**. We have chosen to work with the 1- D version of the network simply because we want to focus on the analysis of its temporal dynamics and its ability to combine information over time. Without loss of generality, the computer simulations that are reported in section Computer Simulations should be considered as a cross-section of a 2-D network.

For simplicity, the thresholds that control the activation of the excitatory and inhibitory nodes are all set to zero and are omitted from the model description. Parameters were set as follows: τ<sup>x</sup> = 5; τ<sup>y</sup> = 2; α = 1; β<sup>1</sup> = 1; β<sup>2</sup> = 10; S<sup>d</sup> = 1; λ = 100; T<sup>d</sup> = 0.1; T<sup>x</sup> = 0.1; and T<sup>y</sup> = 0.1. Parameters were chosen in a way to simultaneously achieve intersection and union. Systematic variations on the parameters α, β<sup>1</sup> and β<sup>2</sup> showed that intersection is observed when 1 ≤ (α, β1) ≤ 5. In contrast, union is observed when 0.8 ≤ (α, β1) ≤ 1. Parameter β<sup>2</sup> can be set to any value above the default without changing the results.

## MODEL EXTENSIONS

The network that is defined by Equations (1) and (2) is chosen in a way that achieves the desired behavior with the minimal number of computational elements. This simplicity heuristic is important for understanding model properties without adding extra neuroscientific complexity (Ashby and Hélie, 2011). However, at the same time, this approach sacrifices anatomical and biophysical plausibility of the proposed model. In this section, we present several extensions and generalizations of the basic model that bring it closer to satisfying the neurobiological constraints.

#### Inhibitory Pool

The model has just one inhibitory interneuron for computational convenience, which is not realistic. It is known that excitatory neurons outnumber inhibitory neurons by a factor of four in the cortex (Braitenberg and Schüz, 1991). However, it is possible to design an F-WTA network with a pool of inhibitory interneurons and the appropriate ratio between excitatory and inhibitory nodes that achieves the same behavior as the original model. An extended F-WTA network is presented in **Figure 5A**. Here, each inhibitory interneuron receives input from a subset of the excitatory nodes. We depicted each excitatory subset as a vertical arrangement of four nodes that do not overlap in their projections to the inhibitory pool. Therefore, each excitatory node projects to just one inhibitory node. Naturally, this does not need to be the case. It is possible that each excitatory node projects to more than one node without compromising the network output. Importantly, all inhibitory interneurons are mutually connected. In addition, each inhibitory interneuron projects its output to all excitatory nodes (denoted by thick blue arrow). As in the original model, we assume that all inhibitory and excitatory nodes are endowed with the capability of retrograde signaling on their synaptic contacts.

Within the pool of inhibitory nodes, retrograde signaling enables computation of the MAX function, as in the original model. To see this, consider the inhibitory node that receives maximal input. Due to the retrograde signaling, it will reach a steady state that corresponds to the computation of the MAX function over input from its excitatory subset. Moreover, it will not receive inhibition from the other members of the pool. All other inhibitory nodes, which receive less excitatory support, will be silenced because their retrograde signaling is not sufficiently strong to prevent lateral inhibition from the winning node. However, if there are multiple inhibitory nodes with the same level of activity, they will remain active together. Finally, the winning nodes send inhibition to all excitatory subsets. Since excitatory nodes also engage in retrograde signaling, the nodes that receive maximal input will block inhibition and remain active. Therefore, the network output will look much like the original model because the MAX computation on the inhibitory nodes makes irrelevant the number of them that are active simultaneously**.**

### Localized Inhibition

An important shortcoming of the previous model is that it assumes that inhibitory projections extend across the whole network of excitatory units. This is clearly not the case in real neural networks, where the spatial spread of inhibition is limited. To account for this property, we have constructed a more elaborate version of the basic model, which is shown in **Figure 5B**. It contains a new pool z<sup>j</sup> of excitatory nodes with longrange projections. The z<sup>j</sup> nodes receive input from the subset of the x<sup>i</sup> nodes. Additionally, each z<sup>j</sup> node sends its projection to at least one y<sup>j</sup> node from the pool of inhibitory nodes. The number of z nodes must equal the number of inhibitory nodes y<sup>j</sup> so that they can be indexed by the same subscript j. Again, we assume that the z<sup>j</sup> nodes are equipped with the ability of retrograde signaling on their synapses. Therefore, they also compute the MAX function over all their inputs, including feedforward input from the corresponding subset of x<sup>i</sup> nodes and recurrent input from other z<sup>j</sup> nodes. In this design, the maximum level of activity that is sensed by the x<sup>i</sup> nodes in one part of the network is easily propagated via z<sup>j</sup> nodes to all other parts of the network. Furthermore, z<sup>j</sup> nodes transfer this activity to inhibitory nodes. Therefore, each inhibitory node will eventually receive the maximal level of activity and apply it to the subset of x<sup>i</sup> nodes to which it is connected. In this design, it is not necessary for inhibitory nodes to interact with one another. The excitatory nodes x<sup>i</sup> that receive maximal input will block inhibition by their retrograde signaling and remain active in the same manner as described in the previous section. In this way, the proposed circuit achieves the same result as the original model.

#### Output Functions

The model employs threshold-linear output functions for the soma and the logistic sigmoid function for dendrites. This is inconsistent with the observation that somatic output also saturates and is also often modeled by the sigmoid function.

inhibitory nodes, which are denoted as yj . Each yj receives input from a subset of excitatory nodes. Inhibitory nodes compete with one another and the winning node encodes the maximum of its input. It delivers inhibition to all excitatory nodes in the same way as single inhibitory node y in the basic circuit. (B) Circuit with an additional set of excitatory nodes zj with long-range horizontal projections. These nodes propagate the locally computed maximum level of activity to all parts of the network. Therefore, the whole set of zj converges to a global maximum. Furthermore, they contact inhibitory nodes yj that deliver inhibition to a subset of excitatory nodes xi .

However, in normal circumstances, neurons operate in a linear mode that is far from their saturation level (Rutishauser and Douglas, 2009). To provide a more systematic approach to the output functions that are used in the model, we introduce a piecewise-linear approximation to the sigmoid function sq(u) of the form

$$s\_q(u) = \begin{cases} 0 & \text{if } \quad u \le 0 \\ u & \text{if } 0 < u < \mathcal{S}\_q \\ \mathcal{S}\_q & \text{if } \quad u \ge \mathcal{S}\_q \end{cases} \tag{7}$$

where S<sup>q</sup> denotes the upper saturation point, which can be set differently for different computational units q ∈ c, d, p , which correspond to the somatic, dendritic, and presynaptic terminal outputs, respectively. With the output function sq(u) applied to all computational elements of a single node, the model equations, namely, Equations (1) and (2), can be restated as

$$\tau\_{\mathbf{x}} \frac{d\mathbf{x}\_i}{dt} + \mathbf{x}\_i = s\_c \left[ I\_i \left( t \right) + \alpha s\_d \left( \mathbf{x}\_i + \mathbf{x}\_{i+1} + \mathbf{x}\_{i-1} - T\_d \right) \right] \tag{8}$$

$$- \beta\_1 s\_p \left( \mathbf{y} - \mathbf{x}\_i - T\_{\mathbf{y}} \right) \tag{8}$$

and

$$x\_{\mathcal{V}} \frac{d\boldsymbol{y}}{dt} + \boldsymbol{y} = s\_c \left[ \beta\_2 \sum\_i s\_{\mathcal{P}} \left( \boldsymbol{x}\_i - \boldsymbol{y} - T\_{\mathbf{x}} \right) \right]. \tag{9}$$

An important constraint of the model that is defined by Equations (8) and (9) is that saturation point for the dendritic output S<sup>d</sup> should be chosen to be smaller than S<sup>c</sup> , which is the saturation point of the somatic output. In this way, feedforward input I<sup>i</sup> can be combined with the dendritic output without causing saturation at the output of the node. In contrast, if dendrites are allowed to saturate at the same activity level as the node, the dendritic output will overshadow the feedforward input. Consequently, the network will lose its sensitivity to the input changes. This is undesirable with respect to the requirements that are imposed by the sequential formation of the multiple Boolean maps. Therefore, the choice between the linear or the sigmoid output function for the node is not important if the dendritic output is restricted to a smaller interval relative to the output of the node itself.

#### LINEAR STABILITY ANALYSIS

#### Fixed Points

Fixed point is found iteratively starting from the set of nodes receiving maximal input, xM. We assume that the winning nodes and inhibitory interneuron are activated above their thresholds, so we set [u] <sup>+</sup> = u. Next, we observe that the winning nodes do not receive inhibition from the interneuron y since xM(t) > y(t) for t > 0. This holds because the activity of the inhibitory node is bounded above by x<sup>M</sup> + T<sup>x</sup> > y where T<sup>x</sup> is a positive constant. Then, retrograde signaling ensures that g(y − x<sup>M</sup> − Ty) = 0 for all times t. Consequently, nodes receiving maximal input are driven solely by excitatory terms. Since the recurrent excitation is bounded above by its asymptotic value Sd, dendritic output function f(u) in Equation (1) is replaced with Sd. This yields the following approximation to the steady state of the winning nodes:

$$
\alpha\_M \approx I\_M + \alpha S\_d. \tag{10}
$$

After the xM, inhibitory interneuron y also reaches its steady state because its activity is driven primarily by the input from xM. As the activity of y grows, terms g(x<sup>i</sup> − y − Tx) in Equation (2) vanish for all nodes that do not receive maximal input x<sup>i</sup> where i ∈/ M. In contrast, the presynaptic terminals of x<sup>M</sup> are above the threshold for their activation just before y reaches equilibrium, that is, g(x<sup>M</sup> − y − Tx) > 0. Therefore, the output function of the presynaptic terminal g(u) can be replaced by u. Then, Equation (2) is solved as

$$\wp = \frac{\beta\_2 k \left(\alpha\_M - T\_\chi\right)}{\beta\_2 k + 1},\tag{11}$$

where k is the number of xM. When β<sup>2</sup> is chosen to be sufficiently large, and/or there are many nodes with maximal input xM, then

$$
\mathcal{Y} \to \mathfrak{x}\_{\mathsf{M}} - T\_{\mathsf{x}}.\tag{12}
$$

Continuity of the function defined by Equation (2) implies that y cannot grow above x<sup>M</sup> − Tx, that is, y(t) > xM(t) − T<sup>x</sup> cannot hold at any time t unless y(t0) = xM(t0) − T<sup>x</sup> at some earlier time t<sup>0</sup> < t. However, equality y(t0) = xM(t0) − T<sup>x</sup> implies that dy/dt = 0 at time t<sup>0</sup> because g(xM(t0) − y(t0) − Tx) = 0. In other words, node y loses all its excitatory drive when it reaches x<sup>M</sup> − Tx. This is true irrespective of the number k of xM. Thus, node y computes the maximum over its input.

The x<sup>M</sup> nodes, together with the inhibitory node, create a quenching threshold (QT) for the network, which is defined by

$$QT = \mathcal{y} - T\_{\mathcal{Y}} = \mathcal{x}\_{M} - T\_{\mathcal{X}} - T\_{\mathcal{Y}}.\tag{13}$$

Grossberg (1973) introduced the concept of the quenching threshold to describe the property of contrast enhancement in recurrent competitive networks. Nodes whose activity is above QT are enhanced and stored in the memory state, while all nodes whose activity is below QT are suppressed and removed from the memory representation. In the same manner, the remaining excitatory nodes converge to one of two states, depending on whether they exceed QT or not:

$$\chi\_{i \notin M} \approx \begin{cases} I\_i + \alpha S\_d & \text{if } \ x\_i \ge QT \\ 0 & \text{if } \ x\_i < QT. \end{cases} \tag{14}$$

QT and its relationship with the activity of the winning and non-winning nodes and inhibitory interneuron is illustrated in **Figure 6**. According to Equations (10), (11), and (14), the fixed-point linearly combines input and recurrent excitation. As maximal input increases or decreases, the fixed point will move up or down and track these changes. Moreover, the input may cease, and the winning nodes will settle into the activity level that is provided by the recurrent excitation alone, which is expressed as αSd. In other words, the network remembers who the last winner was. The same is true in the case where the winner is determined by transient cues that are applied sequentially on a sustained input. This is a protocol that is used in the computer simulations that are reported in section Computer Simulations.

FIGURE 6 | Relationship among the steady state of the winning node x1, inhibitory node y, and all other excitatory nodes in the network, x2 … xn. The activity of the winning node is given by the sum of its feedforward input I1 and the output of its dendrite mediating self- and nearest neighbor excitation, which is expressed as αSd. Inhibitory node y approximately converges to x1 – Tx. It sets the quenching threshold (QT) that separates excitatory nodes into two sets. Nodes x2 … xn are spared from inhibition if their activity is above the QT (dashed line); otherwise, they are silenced to zero (solid line). QT equals y – Ty (or x1 – Tx – Ty) because the activity of the inhibitory node must exceed the threshold on its presynaptic terminals that contact the excitatory nodes.

#### Linearization Near Fixed Points

To simplify the stability analysis, we consider an F-WTA network with two excitatory nodes and one inhibitory node: [x1, x2, y]. This system has three fixed points: x<sup>1</sup> is the only winner, x<sup>2</sup> is the only winner, and both excitatory nodes are winners. To which fixed point the network will converge depends on the relationship between inputs I<sup>1</sup> and I2.

Local stability of the fixed point is estimated from the eigenvalues of the Jacobian matrix, which is the matrix of partial derivatives of the system of equations. If the real parts of all eigenvalues of the Jacobian are negative, the fixed point will be asymptotically stable (Rutishauser and Douglas, 2009). However, before we can compute the Jacobian matrix, we note that a linear-threshold function is continuous, but not differentiable. To sidestep this problem, we follow the approach that was described by Rutishauser et al. (2011) of inserting dummy terms that correspond to the derivate. That is, we need three separate dummy terms: c<sup>i</sup> and pxi, which correspond to the somatic and presynaptic output functions of excitatory node i, and a set of pyi dummy terms that describe the presynaptic output function of inhibitory node y. The dummy terms are defined as.

$$\omega\_i = p\_{xi} = p\_{yi} = \frac{d}{du} [\mu\_i(t)]^+ = \begin{cases} 0 \text{ if } \ u\_i(t) \le 0 \\ 1 \text{ if } \ u\_i(t) > 0. \end{cases} \tag{15}$$

Based on the above definition of the dummy terms, we have constructed the Jacobian matrix of the system that consists of Equations (1) and (2):

$$J = \begin{bmatrix} \mathfrak{r}\_{\mathbf{x}}^{-1} \left( c\_1 \left( aDf + \beta\_1 p\_{\mathbf{y}1} \right) - 1 \right) & \mathfrak{r}\_{\mathbf{x}}^{-1} c\_1 aD\_2 f & \mathfrak{r}\_{\mathbf{x}}^{-1} c\_1 \beta\_1 p\_{\mathbf{y}1} \\\ \mathfrak{r}\_{\mathbf{x}}^{-1} c\_2 aD\_1 f & \mathfrak{r}\_{\mathbf{x}}^{-1} \left( c\_2 \left( aD\_2 f + \beta\_1 p\_{\mathbf{y}2} \right) - 1 \right) & \mathfrak{r}\_{\mathbf{x}}^{-1} c\_2 \beta\_1 p\_{\mathbf{y}2} \\\ \mathfrak{r}\_{\mathbf{y}}^{-1} \beta\_2 p\_{\mathbf{x}1} & \mathfrak{r}\_{\mathbf{y}}^{-1} \beta\_2 p\_{\mathbf{x}2} & -\mathfrak{r}\_{\mathbf{y}}^{-1} \left( \beta\_2 \left( p\_{\mathbf{x}1} + p\_{\mathbf{x}2} \right) - 1 \right) \end{bmatrix} \tag{16}$$

where D1f and D2f denote the partial derivatives of the sigmoid function with respect to x<sup>1</sup> and x2. Now, we examine the Jacobian matrix at the three fixed points that are mentioned above. If x<sup>1</sup> is the only winner, then c<sup>1</sup> = 1. However, Dx1f ≈ 0 because the recurrent excitation of the winning node approaches its asymptotic value, which is Sd. In addition, py<sup>1</sup> = 0 because the winning node blocks inhibition from node y, as discussed above. Node x<sup>2</sup> is inhibited below its somatic threshold, that is, c<sup>2</sup> = 0. Presynaptic signaling by inhibitory node y blocks excitation from x<sup>1</sup> and x<sup>2</sup> is inactive, so px<sup>1</sup> = px<sup>2</sup> = 0. Consequently, the Jacobian matrix at the fixed point reduces to a diagonal matrix of the form

$$J\_{W1} = J\_{W2} = J\_{W12} = \begin{bmatrix} -\mathfrak{r}\_{\chi}^{-1} & 0 & 0\\ 0 & -\mathfrak{r}\_{\chi}^{-1} & 0\\ 0 & 0 & -\mathfrak{r}\_{\chi}^{-1} \end{bmatrix}. \tag{17}$$

All eigenvalues of the JW<sup>1</sup> are negative, and the fixed point is asymptotically stable. In the case when x<sup>2</sup> is the sole winner, the same arguments are applied to set the dummy terms, thereby leading to the same diagonal matrix JW<sup>2</sup> as shown in Equation (17). Moreover, if both excitatory nodes are winners, then c<sup>1</sup> = c<sup>2</sup> = 1, Dx1f = Dx2f ≈ 0 and px<sup>1</sup> = px<sup>2</sup> = 0. Again, the Jacobian matrix JW<sup>12</sup> is diagonal. Thus, all three fixed points are asymptotically stable.

The same analysis can be generalized to a network of arbitrary size and arbitrarily many fixed points. Retrograde signaling and dendritic saturation will ensure that the Jacobian matrix of any size will be diagonal and that the network dynamics will be independent of the network parameters, namely, α, β1, and β2. Local stability analysis suggests that the system behaves much like a feedforward network that is driven by the input. However, an important difference is that the F-WTA network has memory states like the recurrent network (Usher and Cohen, 1999; Rutishauser and Douglas, 2009).

#### COMPUTER SIMULATIONS

We performed a set of computer simulations to illustrate the model behavior. We employed a vector of 200 excitatory units and one inhibitory unit. Differential Equations (1) and (2) were solved numerically using MATLAB's ode15s solver. The simulations were run for 250 time steps. In subsequent figures, we followed the convention that activity of the node at position i as a function of time is depicted by a shade of gray, with white representing the maximal value and black representing zero.

### Simulation of the Formation of a Single Boolean Map

First, we demonstrate how a Boolean map arises in the F-WTA network in response to the presentation of the color cue, as illustrated by **Figure 1A**. In **Figure 7A**, we recreate a similar stimulus condition in the 1-D map. The input consists of red and green items of equal sizes, which are intermixed in space on a black background. Input magnitude I was set to 1 in both maps and to 0.2 in the empty space around items to represent spontaneous activity in the absence of visual stimulation. Initially, the top-down or attentional gain is set to G <sup>m</sup> = 1 in both feature mapsm ∈ red, green . At t = 50, the red color is attended, which is reflected in the input to the network by increasing the gain for all nodes in the Red map (G red = 2) and simultaneously reducing the gain in the Green map by the same factor (G green = 1/G red = 1/2). Top-down gain is also applied to the empty space between items, which is consistent with the finding that feature-based attention spreads across the whole visual field (Saenz et al., 2002, 2003; Serences and Boynton, 2007). The duration of the top-down cue is 50 simulated time steps. For simplicity, top-down signals are suddenly switched on and off without exponential decay. At t = 150, the green color is cued in the same way.

At the beginning of the simulation, before the top-down signals are applied, the F-WTA network simply selects all presented items together, irrespective of their color. Next, when the red color is cued by applying top-down signals to the corresponding feature map, the network responds to the new input by selectively increasing and sustaining the activity of nodes that encode locations of red items in the input and suppressing locations that encode green items. That is, the network creates a Boolean map by highlighting the spatial pattern that is associated with the red color. Furthermore, due to a self-excitation, the network maintains locations of the cued feature value in working memory after the top-down signals cease to influence the feature map. When the observer decides to switch attention to another feature value, the network can select the locations of the new feature value and suppress the locations that are associated with the previously cued value without requiring an external reset. Namely, the network is sensitive to input changes even though it also exhibits activity persistence.

Importantly, the activity level at selected locations is invariant with respect to the number of active nodes. At the beginning of the simulation, the number of active nodes was four times larger than after the cue was delivered. However, the active nodes remained at the same activity level as they were at the beginning of the simulation. This is a consequence of retrograde inhibitory signaling in recurrent pathways. It prevents unbounded growth of inhibition due to the dynamic regulation of its strength. To illustrate this point further, we run another simulation with items that are almost double in size (**Figure 7B**). Even though the total size of the cued items is increased, the activity of the cued nodes converges to the same level as before. In this simulation, we also checked that the network successfully operates even if we remove gain reduction from the non-attended feature map.

the unattended feature map.

Next, we determined the minimal feature gain that must be applied on the input to produce the desired behavior. When the gain modulation is applied simultaneously on attended feature map G <sup>A</sup> and on unattended feature map G NA (where G NA = 1/G <sup>A</sup>), we found that G <sup>A</sup> ≥ 1.7 is sufficient for creating a Boolean map and switching to another one. In contrast, when the gain modulation is not applied on the unattended feature map, as shown in **Figure 7B**, the feature gain in the attended map should be set to G <sup>A</sup> ≥ 2 to achieve the same behavior.

**Figure 8** illustrates that the F-WTA network can support space- and object-based attention alongside feature-based attention. When the spatial cue is applied to a single location in one of the feature maps, the network responds by selecting only this location. Neighboring nodes are not selected even though they are reciprocally connected to the cued node. The reason is that they receive weaker input relative to the cued node. Furthermore, recurrent excitation that arrives from the cued node is bound by the dendritic non-linearity. Thus, it is not sufficiently strong to keep them active. Interestingly, when the spatial cue is removed, the network activity starts to propagate from the cued node toward the boundary of the whole item. In this case, the network selects not just the cued location, but all locations that are connected to it. Therefore, the F-WTA network exhibits object-based selection, which is consistent with neurophysiological studies that show spreading of enhanced activity along the shape of the object (Roelfsema, 2006). This property arises because the removal of the cue equalizes the input magnitude along the object, which allows activity enhancement to propagate via local lateral connections.

In addition, this simulation shows that spatial attention can be easily oriented toward a new location in a single jump without the need for attentional pointers that move attention across the map (Hahnloser et al., 1999).

# Simulation of the Intersection and Union of Two Boolean Maps

**Figure 9** illustrates that the model can sequentially combine two Boolean maps when the network is cued by top-down signals from two separate feature dimensions. In this simulation, we have employed a visual input that consists of red and green horizontal and red and green vertical bars, like those that are illustrated in **Figure 1B**. First, the F-WTA network is cued to select red bars, irrespective of their orientation. In the second step, it is cued to select horizontal bars, irrespective of their color. However, green vertical bars are already suppressed and the top-down signal that is supplied to them is not sufficient to override the inhibition that arises from red vertical bars. The net result is the selection of a subset of red horizontal bars. In other words, the network activity converges to an intersection between a set of red bars and a set of horizontal bars, thereby resulting in the selection of red horizontal bars.

Next, we examined how the network achieves the union of two Boolean maps (**Figure 10**). Here, we assumed that the input consists of two non-overlapping components: colored squares that activate color maps but do not activate orientation maps, and achromatic horizontal and vertical bars that activate orientation maps but do not activate color maps, as shown in **Figure 1C**. Red-colored items occupy locations between 1 and 100 and oriented bars occupy locations between 101 and 200. This closely resembles the stimulus that is used by Huang and Pashler (2007) to demonstrate the union of color and texture. Taken together, the data show that the union of two Boolean maps is possible only when two top-down cues overlap in time or when the second cue closely follows the withdrawal of the first cue. In **Figure 10**, the cue for the red map is applied in the interval [50, 100] and the cue for the horizontal map is applied in the interval [110, 160]. In this case, the F-WTA network converges to the union of red and horizontal items. However, when top-down cues do not overlap, as shown in **Figure 11**, the second cue overrides the network activity that remains from the first cue. We suggest that this property partly explains why the union is difficult to achieve, as observed by Huang and Pashler (2007).

In addition, we examine the boundary conditions on the choice of the feature gain parameter. We parametrically vary the feature gain in steps of 0.1 starting from G = 2 and moving below and above to determine when the ability to form the intersection or union breaks down. When the gain modulation is applied simultaneously on attended (G <sup>A</sup>) and unattended (G NA) feature maps, we find that G <sup>A</sup> should be chosen from the interval [1.5, 2.1] to achieve the intersection between two maps. When G <sup>A</sup> < 1.5, the network fails to segregate cued from non-cued locations in the first step. In contrast, when G <sup>A</sup> > 2.1, the network successfully segregates cued from non-cued locations in the first step. However, the gain is too high, so all horizontal items are selected together in the second step. That is, the representation of red horizontal items is merged with the representation of green horizontal items. When G NA = 1 throughout the simulation, G <sup>A</sup> should be chosen from the interval [1.8, 2.0] to achieve intersection.

With respect to the union of two maps, the feature gain G <sup>A</sup> should be chosen from the interval [1.4, 2.0] when G NA = 1/G <sup>A</sup> and from the interval [1.6, 2.0] when G NA = 1. When G <sup>A</sup> is chosen below the suggested intervals, feature gain is too weak, and the second cue will not be able to raise the activity level of the nodes that represent horizontal items above the quenching threshold. Therefore, the network ends up with the Boolean map of red items that is formed in the first step. When G <sup>A</sup> is chosen above the suggested interval, the network switches between the representation of the red items in the first step to the representation of the horizontal items in the second step. In this case, the feature cue is too high, and the activity of the nodes that represent horizontal items simply overrides the activity of the nodes that represent the red items. These constraints are derived from the situation in which the two top-down cues overlap in time. As shown above, temporal lag of the second cue relative to the first cue also destroys the ability of the network to form the union of two Boolean maps.

#### Simulation of Bottom-Up Spatial Selection

Finally, we have shown that when there is no top-down guidance, the network selects the most-salient locations based on the bottom-up salience that is computed within feature maps (**Figure 12**). We did not explicitly model competition among maps, but it is reasonable to assume that in a scene with many multi-featured objects, their input magnitudes (i.e., saliencies) will be different. Therefore, we arbitrarily assigned different input magnitudes to different items. As shown in **Figure 12A**, the F-WTA network selects the most salient object if the difference in input magnitude between the two most active nodes is sufficiently large. However, when this difference is small, as shown in **Figure 12B**, the F-WTA model chooses two most salient items together. Furthermore, in both examples, the network activity retains the input amplitude of the winning item (or items), thereby illustrating the ability to compute the function maximum (Yu et al., 2002).

The precision of saliency detection depends on the threshold for the activation of synaptic receptors on the inhibitory interneuron. In all reported simulations, it was set to T<sup>y</sup> = 0.1.

If smaller values were chosen, the network would improve in terms of precision and be able to separate the two objects that are presented in **Figure 12B**. However, this comes at the price of losing the ability to form a union of two Boolean maps. Therefore, there is a trade-off between the precision of saliency detection and the ability to form Boolean maps.

An important aspect of stimulus-driven attentional control is attentional capture by peripheral cues. Behavioral studies have shown that the abrupt onset of a new object in a visual scene can automatically capture attention even if it is irrelevant for the current goal (Theeuwes, 2010). **Figure 13** illustrates the sensitivity of the F-WTA network to abrupt visual onset. To simulate this effect, we have made the additional assumption that the network receives input not only from a sustained channel that is comprised of feature maps in V4 but also from a transient channel that responds vigorously only to changes in input (Kulikowski and Tolhurst, 1973; Legge, 1978). Thus, when the abrupt onset is accompanied by a strong transient signal that exceeds the activity level of the currently attended item, the F-WTA network temporarily switch activity toward the location of

the onset (**Figure 13A**). Here, the input at the locations that are occupied by the winning item in the center of the map was set to I<sup>W</sup> = 2. Input to all other items was set to I<sup>i</sup> = 1. Finally, the transient input that appears on the sides of the map was set to I<sup>T</sup> = 4. It is sufficient to set I<sup>T</sup> ≥ I<sup>W</sup> + 0.8 to achieve sensitivity to abrupt onsets. Moreover, the same relation holds even if we choose a larger value for IW.

Next, when abrupt onset produces only weak transient signals (I<sup>T</sup> = 2) that do not satisfy the inequality that is stated above (I<sup>W</sup> = 2), the activity in the F-WTA network resists abrupt onset and stays on the previously attended item (**Figure 13B**). This observation is consistent with behavioral findings that abrupt onset can be ignored (Theeuwes, 2010), perhaps by attenuating the response of the transient channel. Another possibility is that the top-down gain for the attended location can be increased so that it exceeds the activity of the transient channel. In this case, intense focus on the current object prevents attentional capture, which is consistent with the psychological concept of the attentional window (Belopolsky and Theeuwes, 2010).

### DISCUSSION

We have proposed a new model of the WTA network that can simultaneously select multiple spatial locations based on a shared feature value. We named the model the feature-based WTA (F-WTA) network because the unit of selection is not a point in space or object, but rather an abstract feature value that is set by the top-down signals. We have demonstrated how the F-WTA network implements the central proposal of the Boolean theory of visual attention that there exists a spatial map that divides the visual space into two mutually exclusive sets. One set represents all locations that are occupied by the chosen feature value. The other set contains all other locations, which are not of interest. The Boolean map controls spatial selection and access to the consciousness (Huang and Pashler, 2007). Moreover, we have shown that the network successfully integrates information across space and time to form the intersection or union of two maps that are defined by different feature cues. Previous models of the WTA network are not capable of such integration because they require that the current winner be externally inhibited to allow attentional focus to move from one location to another (Kaski and Kohonen, 1994; Itti and Koch, 2000, 2001). Another possibility to move activity across locations in the network is to introduce dynamic thresholds that simulate habituation or fatigue in individual neurons. In this case, current winner loses its competitive advantage due to the raise of its threshold. This allows non-winners to gain access to working memory (Horn and Usher, 1990). However, both approaches are not suitable for forming the intersection or union of a set of previous winners and a set of later winners.

Another important property of the F-WTA network that sets it apart from previous models of WTA behavior is the ability to select and store arbitrarily many locations in the memory. This is achieved by inhibitory retrograde signaling, which effectively isolates winning nodes from mutual inhibition. First, the amount

of inhibition in the network is significantly reduced because the inhibitory interneuron computes the maximum instead of the sum of the recurrent input that it receives from the excitatory nodes. Second, the winning excitatory nodes release their retrograde signals and block inhibition from the interneuron. Consequently, arbitrarily many winners can participate in representing the selected locations without degrading their activation. In other words, there is no capacity limit on the number of objects that can be simultaneously selected. This is consistent with recent behavioral findings that suggest that our ability to select multiple objects is not fixed. Rather, spatial attention should be considered a fundamentally continuous resource without a strict capacity limit (Davis et al., 2000, 2001; Alvarez and Franconeri, 2007; Liverence and Franconeri, 2015; Scimeca and Franconeri, 2015).

In addition, the network is sensitive to the sudden appearance of a new object in the scene, which suggests that it can also be guided by bottom-up feature cues (Theeuwes, 2013). We hypothesize that the network receives strong input from the transient channel. Such input overrides the network's current memory state, thereby making it sensitive to abrupt onsets. Moreover, the transient channel can be activated by any type of change in the spatiotemporal energy of the input, and not just by the sudden appearance (or disappearance) of objects. For example, it will be activated by a sudden change in the direction of motion (Farid, 2002). When the network simultaneously receives transient input from different locations, they all will be selected together. In this way, the network achieves temporal grouping of synchronous transient input. That is, the network can discover spatial structures that are defined purely by temporal cues (Lee and Blake, 1999; Rideaux et al., 2016).

#### Biophysical Considerations

As noted above, the model of the F-WTA network rests upon three key computational elements: the dendrite as an independent computational unit, retrograde signaling on synaptic contacts, and computing the maximum over inputs. Here, we review supporting neuroscientific evidence that suggests that all three biophysical mechanisms are plausible candidates for computation in real neural networks.

There is a growing body of evidence that the excitatory pyramidal cell should not be viewed as a single electrical compartment. Rather, it consists of multiple independent synaptic integration zones arranged in a two-layer hierarchy (Häusser and Mel, 2003; London and Häusser, 2005; Branco and Häusser, 2010; Mel, 2016). Using a detailed biophysical model of the pyramidal neuron, Poirazi et al. (2003) showed that its output is well approximated by a two-layer neural network. In the first layer of the network, dendrites independently integrate their synaptic input and produce sigmoidal output. In the second layer, the dendritic output is summed at the soma to produce the neuron's firing rate. Importantly, the somatic and dendritic output functions need not be the same (Jadi et al., 2014). For example, Behabadi and Mel (2014) showed that the soma of the model neuron generates nearly linear output, while the dendritic output is sigmoid. In our model, the dendrite conveys recurrent excitation to the node. Due to the dendritic non-linearity, there is no risk of unbounded activity growth in the node. Furthermore, the dendritic output is summed with the external input at the soma of the node. By using a linear output function at the soma, we have ensured that the F-WTA network remains sensitive to input fluctuations.

Synaptic transmission can be dynamically regulated in an activity-dependent manner, as shown by the existence of depolarization-induced suppression of inhibition (DSI) (Pitler and Alger, 1992) and depolarization-induced suppression of excitation (DSE) (Kreitzer and Regehr, 2001). DSI (DSE) refers to the reduction in inhibitory (excitatory) post-synaptic potentials following depolarization of the postsynaptic cell. These processes have been observed in various brain regions, including the cerebellum, hippocampus, and neocortex. A retrograde messenger that is released from postsynaptic cell due to its depolarization mediates DSI and DSE. After release, the retrograde messenger binds to the receptors at the presynaptic axon terminals and suppresses the release of the transmitter. Based on these properties, Regehr et al. (2009) suggested that a possible physiological function of DSI and DSE is to provide negative feedback that reduces the impact of the synaptic input on the ongoing neural activity.

The model behavior rests upon the assumption that the inhibitory interneuron computes the maximum instead of the sum of its inputs. There is some direct physiological evidence that real cortical neurons indeed compute the MAX function. For example, Sato (1989) examined responses of neurons in the primate inferior temporal cortex to the presentation of one or two bars in their receptive field. He concluded that the responses to two bars that were presented simultaneously were well described by the maximum of the responses to each separately. In a similar vein, Gawne and Martin (2002) recorded the activity of neurons in primate V4 and found that their firing rate in response to the combination of stimuli is best described by the maximum function over the firing rates that are evoked by each stimulus alone. Furthermore, Lampl et al. (2004) directly measured membrane potentials in the complex cells of the cat primary visual cortex and found evidence for the MAX-like behavior in response to the pair of optimal bars.

Indirectly, the importance of the MAX-like operation in cortical information processing can be appreciated by considering the many computational models of visual functions that have employed it in simulating rich and complex datasets. For example, Riesenhuber and Poggio (1999) employed hierarchical computation of the MAX function in a model of invariant object recognition. Spratling (2010, 2011) used it in simulating a large range of classical and non-classical receptive field properties of V1 neurons. Moreover, Tsui et al. (2010) used MAX-like input integration to explain diverse properties of MT neurons and Hamker (2004) used it in his model of top-down guidance of spatial attention. Furthermore, Kouh and Poggio (2008) developed a canonical cortical circuit that is capable of many non-linear operations, including computation of the MAX function. Here, we have shown that a single inhibitory node that is endowed with retrograde signaling can compute the maximum.

Based on the proposed model, we have derived two testable predictions. The cortical network that is involved in spatial selection will contain inhibitory interneurons that can compute the MAX function. Moreover, both the excitatory and inhibitory neurons in this network will be endowed with the anatomical structures that support retrograde signaling (presynaptic receptors and postsynaptic transmitter release sites).

# Comparison With Other WTA Network Models

Several models of biophysical mechanisms have been proposed for implementing WTA behavior in a neural network, including linear-threshold units (Hahnloser, 1998; Rutishauser and Douglas, 2009), non-linear shunting units (Grossberg, 1973; Fukai and Tanaka, 1997), and oscillatory units (Wang, 1999; Borisyuk and Kazanovich, 2004).

A simple model of a competitive network that is based on linear-threshold units has been extensively studied. Stability analysis revealed that this network requires fine-tuning of the connectivity to achieve stable dynamics that can perform cognitively relevant computations, such as choice behavior (Hahnloser, 1998; Hahnloser et al., 2003; Rutishauser et al., 2015). Recently, Binas et al. (2014) showed that a biophysically plausible learning mechanism could tune the network connections in a way that keeps the network dynamics in the stable regime. Here, we have shown how dendritic and synaptic non-linearities ensure that the network dynamics near fixed points depends only on the time constants of the nodes and not on the parameters that control recurrent excitation and lateral inhibition. Therefore, a precise balance between excitation and inhibition is not necessary for achieving a stable memory state. Moreover, the network is sensitive to the input and can iteratively combine the current memory state with new input to form the intersection or union of them.

An important problem for WTA networks that are based on the linear-threshold or sigmoid output functions is that they lack a mechanism for controlling inhibition between the winning nodes. Therefore, they have limited capacity to represent multiple winners. Usher and Cohen (1999) showed that their activation decreases up to the point of complete inactivation as the number of winning nodes increases. This is due to the increased amount of mutual inhibition. The problem cannot be solved simply by reducing the strength of the lateral inhibition because it is not known in advance how many locations will be cued. On the other hand, feature-based spatial selection requires that the network be able to adjust automatically the amount of inhibition to accommodate the selection of a very small or very large number of winners.

Grossberg (1973) proposed a recurrent competitive map model that was based on shunting non-linear interaction between the synaptic input and the membrane potential. The output of the model depends on the exact form of the signal function that is used to convert membrane potential into the firing rate. When the signal function is chosen to grow faster than linear, the network exhibits WTA behavior. By contrast, when the signal function is sigmoid, the network can select multiple winners if they have similar activity levels. The most important property of this model is the existence of the quenching threshold. All nodes whose activity is above QT are enhanced and all nodes whose activity is below QT are suppressed. This behavior is similar to the operation of the F-WTA network that was proposed here. However, an important difference is that in the shunting model, QT is fixed and dependent on the parameters of the network. In contrast, the feature-based WTA network exhibits dynamic QT that depends on the input to the network and not on its parameters. In this way, the F-WTA network rescales its sensitivity to the input fluctuations.

More recently, a version of the recurrent competitive map was applied in modeling object-based attention (Fazl et al., 2009). It was shown that sustained network activity in the model PPC encompasses the whole object as an attentional shroud around it. Such spatial representation of a single object supports view-invariant object recognition within a larger neural architecture, namely, ARTSCAN. In an extension of the model, Foley et al. (2012) proposed two separate competitive networks that account for distinct properties of object- and space-based attention. A network with strong inhibition is limited to the selection of a single object. The other network utilizes weaker inhibition to support multifocal spatial selection. To increase the capacity of this network to represent multiple objects, Foley et al. (2012) suggested that the amount of lateral inhibition could be controlled externally. As the number of objects that should be selected together increases, the lateral inhibition should become weaker to counteract the effect of the larger number of nodes that participate in the competition. In contrast, the F-WTA network does not require such external adjustments of the strength of the lateral inhibition to accommodate the selection of arbitrarily many objects of arbitrary size. Moreover, in the F-WTA network, object-based and multifocal spatial attention coexist within the same circuit. Whether the network exhibits object-based spatial selection depends on the type of cue that is presented to the network and not on its parameters.

Wang (1999) proposed a model of object-based attention that relies on the phase synchronization and desynchronization among oscillatory units. At each location of the recurrent map, there is a pair of excitatory and inhibitory units with distinct temporal dynamics that creates a relaxation oscillator. Excitatory units are also mutually connected with their nearest neighbors and with a global inhibitor. The network is initialized with random phase differences between oscillators at different network locations. The activity of the global inhibitor further enforces phase separation among excitatory units. However, local excitatory interactions among nearest neighbors oppose global inhibition and result in phase synchronization that spreads among nodes that encode the same object. The net result of these interactions is temporal segmentation and selection of one active object representation at a time in a multi-object input image. Importantly, the network can switch its activity from one object representation to another. However, this transition is generated internally by the oscillator dynamics. It is not possible to drive the object selection by external cues such as top-down gain control or bottom-up cues such as abrupt onsets. Moreover, it is not possible to enforce simultaneous selection of more than one object by a joint feature value because the global inhibitor will desynchronize all nodes that encode non-connected items. Therefore, it is not clear how synchronous oscillations could support feature-based attentional selection. Taken together, it is still an open issue whether they are relevant for perception and cognition (Ray and Maunsell, 2015).

#### Limitations

The proposed model of spatial selection successfully simulates the formation of the Boolean map and its elaboration by the set operations of intersection and union but does not fully implement all aspects of the theory that was proposed by Huang and Pashler (2007). Precisely, it does not explain why attention is limited to only one feature value per dimension or how the observer sequentially chooses one feature value after another or combines feature dimensions into intersections or unions of Boolean maps. It is likely that this severe limitation arises from some form of the WTA network. However, this constraint requires a more elaborate model of the interactions among the spatially invariant representation of the feature values in the IT cortex and the interactions between the IT and the prefrontal cortex, where decisions and plans are made.

In all simulations that are reported here, we kept items segregated in space. This was not the case in the stimuli that were used by Huang and Pashler (2007). They employed a matrix of colored squares that were connected to one another. This is because activity spreading can occur among adjacent nodes even if they encode different feature values. Activity spreading is observed after top-down signals stop favoring one feature value over the other. In this case, all feature maps contribute equally to the input of the F-WTA network and the network is no longer able to discriminate between selected and unselected feature values. One way to solve this issue is to assume that the top-down signals are constantly present during the whole trial. In this way, the activity magnitude on the cued locations is kept above that on the non-cued locations. Therefore, noncued locations are treated as background noise and suppressed, despite their proximity to the cued locations. Another possibility is to impose boundary signals that act upon recurrent collaterals of the nodes in the F-WTA network in a way that is similar to how activity spreading is stopped in the network models of brightness perception (Grossberg and Todorovic, 1988 ´ ), visual segmentation (Domijan, 2004), and figure-ground organization (Domijan and Šetic, 2008 ´ ).

Finally, input to the network does not follow the distancedependent activity profile that is usually observed in the visual cortex. However, this is not a critical issue for the model's performance because the precision of selection depends on the thresholds for presynaptic terminal activation, namely, Tx, and Ty. If they are set to very small values, the network will tend to select the centers of the objects when the input pattern is convolved with a Gaussian filter. In contrast, if they are set to larger values, the network will be able to select extended parts of the objects and possibly even the whole objects. In the same way, the model achieves resistance to the input noise. As thresholds are set to larger values, the network can tolerate a larger amount of noise. However, this comes at a cost of less-precise selection, as demonstrated by the simulation that is shown in **Figure 12**.

#### CONCLUSIONS

We have demonstrated how the feature-based WTA network achieves spatial selection of all locations that are occupied by the same feature value without suffering from capacity limitations. The network responds to the top-down cue by storing in memory spatial pattern that corresponds to the cued feature value, while non-cued feature values are suppressed. In this way, we have shown how the Boolean map is formed. In addition, we have shown that it is possible to create more complex spatial representations that involve the intersection or the union of two

### REFERENCES


or more Boolean maps. In this way, the F-WTA network goes beyond the capabilities of previous models of the competitive neural network, which cannot integrate information across space and time. Our work suggests that dendritic non-linearity and retrograde signaling are biophysically plausible mechanisms that are essential for model success.

#### AUTHOR CONTRIBUTIONS

DD designed the study and write the manuscript. MM performed computer simulations and write the manuscript.

# FUNDING

This research was supported by the Croatian Science Foundation Research Grant HRZZ-IP-11-2013-4139 and the University of Rijeka Grant 13.04.1.3.11.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer, MU, and handling editor declared their shared affiliation.

Copyright © 2018 Mari´c and Domijan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Globally Normal Bistable Motion Perception of Anisometropic Amblyopes May Profit From an Unusual Coding Mechanism

Jiachen Liu<sup>1</sup> , Yifeng Zhou1,2 and Tzvetomir Tzvetanov 1,3 \*

*<sup>1</sup> Hefei National Laboratory for Physical Sciences at Microscale, School of Life Science, University of Science and Technology of China, Hefei, China, <sup>2</sup> State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Science, Beijing, China, <sup>3</sup> Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine and School of Computer and Information, Hefei University of Technology, Hefei, China*

Anisometropic amblyopia is a neurodevelopmental disorder of the visual system. There is evidence that the neural deficits spread across visual areas, from the primary cortex up to higher brain areas, including motion coding structures such as MT. Here, we used bistable plaid motion to investigate changes in the underlying mechanisms of motion integration and segmentation and, thus, help us to unravel in more detail deficits in the amblyopic visual motion system. Our results showed that (1) amblyopes globally exhibited normal bistable perception in all viewing conditions compared to the control group and (2) decreased contrast led to a stronger increase in percept switches and decreased percept durations in the control group, while the amblyopic group exhibited no such changes. There were few differences in outcomes dependent upon the use of the weak eye, the strong eye, or both eyes for viewing the stimuli, but this was a general effect present across all subjects, not specific to the amblyopic group. To understand the role of noise and adaptation in such cases of bistable perception, we analyzed predictions from a model and found that contrast does indeed affect percept switches and durations as observed in the control group, in line with the hypothesis that lower stimulus contrast enhances internal noise effects. The combination of experimental and computational results presented here suggests a different motion coding mechanism in the amblyopic visual system, with relatively little effect of stimulus contrast on amblyopes' bistable motion perception.

#### Edited by: *Hedva Spitzer,*

*Tel Aviv University, Israel*

#### Reviewed by:

*Richard J. A. Van Wezel, University of Twente, Netherlands Benjamin Thompson, University of Waterloo, Canada*

> \*Correspondence: *Tzvetomir Tzvetanov tzvetan@hfut.edu.cn*

#### Specialty section:

*This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience*

Received: *09 December 2017* Accepted: *22 May 2018* Published: *07 June 2018*

#### Citation:

*Liu J, Zhou Y and Tzvetanov T (2018) Globally Normal Bistable Motion Perception of Anisometropic Amblyopes May Profit From an Unusual Coding Mechanism. Front. Neurosci. 12:391. doi: 10.3389/fnins.2018.00391* Keywords: plaid motion, anisometropic amblyopia, motion coding mechanism, bistable percept, model prediction

# INTRODUCTION

Amblyopia is a neurodevelopmental disorder of the visual system. The condition is caused by an imbalance in visual input during cortex development, mostly in infancy (Wong, 2012; Hess and Thompson, 2015). Anisometropic amblyopia is typically due to the presence of a chronic blur. These conditions result in a weakening or suppression of the input from the amblyopic eye, and, thus, this input is processed abnormally within the visual cortex (Hubel and Wiesel, 1965, 1970; Kiorpes and McKee, 1999; Hess and Thompson, 2015). Such an abnormal processing causes amblyopes to see differently from neurotypical subjects in visual perception tasks; for example, amblyopes may exhibit a reduction in contrast sensitivity, stereoacuity (3D, depth perception), or visual acuity (Bradley and Freeman, 1981; Levi et al., 2011). In contrast, suprathreshold contrast perception seems equivalent between both eyes of amblyopes (Hess and Bradley, 1980), while prolonged observations of static gratings by amblyopes make them report illusory static or dynamic patterns in the stimulus (Sireteanu et al., 2008; Thiel and Iftime, 2016).

In addition to the above basic visual features, other spatial and temporal processing are also affected by amblyopia in early visual cortices (Barnes et al., 2001; Bonhomme et al., 2006; Hess et al., 2010; Li et al., 2011). Increasing evidence has demonstrated that amblyopia is also associated with abnormal function of the MT/MST areas, which are highly motionsensitive and related to local and global motion integration (Britten et al., 1992; Born and Bradley, 2005; Majaj et al., 2007). There is strong neurophysiological evidence to suggest that motion integration and segregation processing involve area MT (Newsome and Parés, 1988; Salzman et al., 1990). In addition, psychophysical studies have shown abnormal global motion perception in amblyopia, even after adjusting for the deficits in contrast sensitivity. These results strongly suggest that the motion-sensitive areas MT/MST are affected by this disorder (Ellemberg et al., 2002; Constantinescu et al., 2005; Simmers et al., 2006; Aaen-Stockdale et al., 2007; Thompson et al., 2008; Ho and Giaschi, 2009; El-Shamayleh et al., 2010), and a recent neuroimaging study found evidence of abnormal cortical processing of pattern motion in amblyopia (Thompson et al., 2012).

In psychophysical research, plaid motion is a particular stimulus used to investigate the underlying neural mechanisms of motion integration and segregation (Adelson and Movshon, 1982). Plaid stimuli are typically constructed from two drifting gratings within a circular aperture. The drifting directions of both gratings are different. When the two gratings have similar temporal and spatial properties, the stimulus will produce an initial percept of a single patterned surface drifting in a "global" direction, which is a unique combination of both component directions. With prolonged observation of the pattern, a perceptual switching phenomenon occurs; the plaid motion can be seen either as "coherent motion" (a single object moving rigidly) or as "transparent motion" (two independent gratings sliding over each other), dubbed bistable motion perception. Because of the advances in the theoretical understanding of bistable perception, we considered that plaid motion would be a particularly useful probe for investigating the mechanisms of motion segmentation and integration and help us to unravel in more detail the deficits in the amblyopic visual motion system.

The various observations of bistable perception have inspired models of multistability, which mainly focus on bistable rivalry (Lago-Fernández and Deco, 2002; Laing and Chow, 2002; Moreno-Bote et al., 2007). In such models, the random alternation of percepts is influenced by the competition between two neuronal populations via reciprocal inhibition, noise levels in the neural inputs and some sort of adaptation, e.g., spike frequency adaptation and/or synaptic depression. Such models are extendable to tristable percepts, of which plaid motion perception is argued to be an example (Huguet et al., 2014). In all of these models, the exact number of percept switches together with the durations of the two major types of percepts are very sensitive to internal variables, especially internal noise. Thus, any changes in internal variables differentially affect all measurable variables.

This manuscript first describes results of three experiments performed to compare the bistable motion perception in anisometropic amblyopes (AMB) and neurotypical observers (NTE). Experiment 1 was mainly performed as an exploratory study to search for plausible differences between AMB and NTE in plaid motion perception. This experiment led to the hypothesis of differential effects associated with stimulus strength between AMB and NTE that was tested in Experiment 2. Experiment 3 was a control test of the main finding of contrast effects. In the last part, with the help of simulations, we analyzed one model predictions (Moreno-Bote et al., 2007) in order to compare to the experimental results, and thus to propose putative changes in the mechanisms of motion coding in the amblyopic visual system.

# METHODS

### Observers

A total of 32 observers participated in the experiments, including 17 normal-sighted subjects (five women and 12 men; including two authors; age range 20–42) and 15 anisometropic amblyopes (one woman and 14 men; age range: 23–27). A portion of the observers in these two groups participated in experiments 1, 2, and 3. The exact number of subjects within a given experiment is stated in the corresponding section. All amblyopes had anisometropic amblyopia; amblyope #10 had bilateral amblyopia. For that person, the eye with the best visual acuity (strong eye) was treated as the fellow eye in all the analysis. Detailed ophthalmologic characteristics of these observers, including amblyopia type and optical correction, were obtained during normal university medical examinations at the department of ophthalmology in the hospital of USTC. The amblyopic group was defined according to the Preferred Practice Protocol (PPP) of The American Academy of Ophthalmology (Wallace et al., 2018), with anisometropic type was defined as the difference of dioptre sphere above 1.5 and/or the difference of dioptre of cylinder over 1.0 who can not fuse image in retina well binocularly. Nonamblyopes had normal or corrected-to-normal eyesight, while amblyopes wore their best refractive corrections. All observers provided informed consent and received a fee of 60 CNY/hour for participating in the experiments. The experiments were approved by the ethics committee of the School of Life Science of USTC and followed the tenets of the Declaration of Helsinki for experiments with human subjects. **Table 1** presents the eyes characteristics of the amblyopes.

### Apparatus

Stimuli were presented on an ASUS VG248 monitor with a 1,920 × 1,080-pixel resolution at a frame rate of 120 Hz. Observers were comfortably seated 100 cm in front of the screen in a dark room, with their chin and forehead resting on a chinrest. When the eye signal was available, binocular or monocular eye movements (randomly) were monitored and recorded for a portion of the observers (13 amblyopes/10 normal observers) with an Eyelink 1,000 eye recording setup and sampled at 500 Hz to confirm correct eye fixation at the stimulus location.


*Obs, observer; Amb, anisometropic amblyope; M, male; F, female; RE, right eye; LE, left eye; anis, anisometropic; DS, dioptre sphere; DC, dioptre of cylinder; Ø, plano; SA, stereo acuity; VA, visual acuity; MAR, minimum angle of resolution.*

#### Stimuli

The stimulus comprised two rectangular-wave gratings presented through a circular aperture 7.7◦ in diameter on a middle-gray background of RGB 126. Gratings moved at 3◦ /s (defined in the direction normal to their orientation) in directions 90◦ apart (angle α hereafter), with a spatial frequency of 3 c/d and duty cycle of 50%. The mean direction of motion of both gratings was either vertical upward or horizontal leftward, thus making the coherent pattern perceived as moving upwards or leftwards, respectively. Grating contrast was defined in RGB units, and two contrasts of 30% (high) and 5% (low) values were possible, with both gratings having the same contrast. A pink fixation point was added in the middle of the circular aperture to help subjects locate the stimulus center and minimize optokinetic nystagmus (Huguet et al., 2014), and subjects were instructed to fixate this point throughout the stimulus presentation.

#### Experimental Procedure

Subjects were first familiarized with the stimuli and procedure. They had to report the time of percept change with two keyboard keys, with each key indicating that they perceived either coherent motion or transparent motion. They were instructed to passively report the percepts, without trying to influence them. Each observer was exposed to both global coherent directions (upward and leftward) to avoid motion direction adaptation, one (Experiment 1) or two (Experiment 2) contrast levels (for Experiment 1, 30% contrast; for Experiment 2, 30 and 5% contrast), and three eye conditions (binocular, left, right eye monocular), corresponding to a total of 6 or 12 different stimulus configurations. Presentation time was 120 s for each stimulus, and observers were tested on each configuration one time. The order of presentation was random. Because the first percept is known to always be coherent in normal-sighted observers (Hupé and Rubin, 2003), and amblyopes are able to demonstrate possible grating misperceptions/illusions (Hess et al., 1978; Hess and Bradley, 1980; Thompson et al., 2008; Thiel and Iftime, 2016), each observer was debriefed at the end of each 120 s trial about their first percept (coherent or not) and overall visibility of the pattern. All participants reported that they could clearly see the stimuli, a single moving plaid stimulus and two grating surfaces sliding over each other, in all conditions, even at the lowest contrast used in this study. No amblyopes reported differences between AE and fellow eye perception of the moving gratings, out of the switch rate/duration differences. The dominant eye of each subject was assessed with the hole-in-card experiment. Stereo acuity was assessed with the Titmus Stereopsis Test. Visual acuity was measured using a standard wall-mounted Tumbling E chart, from a distance of 5 metres, and defined as the score associated with a correct judgment rate of 75% at the minimum angle of resolution.

#### Model Simulation and Numerical Procedures

We implemented the tristable model of motion coherence/transparency proposed by Huguet et al. (2014). This model is a firing rate-based tristable model that includes three pools of neuronal populations that encode three different percepts: coherence (C), transparent with the leftward moving grating on top (TL), and transparent with the rightward moving grating on top (TR). The equations describing the dynamics of the three populations are:

$$\begin{aligned} \text{tr}\frac{dr\_{\mathcal{c}}}{dt} &= \ -r\_{\mathcal{c}} + \mathcal{S}(-\beta\_{1}r\_{T\_{R}} - \beta\_{1}r\_{T\_{L}} - a\_{\mathcal{c}} + I\_{\mathcal{c}} + n\_{\mathcal{c}})\\ \text{tr}\frac{dr\_{T\_{R}}}{dt} &= \ -r\_{T\_{R}} + \mathcal{S}(-\beta\_{1}r\_{\mathcal{c}} - \beta\_{2}r\_{T\_{L}} - a\_{T\_{R}} + I\_{T\_{R}} + n\_{T\_{R}})\\ \text{tr}\frac{dr\_{T\_{L}}}{dt} &= \ -r\_{T\_{L}} + \mathcal{S}(-\beta\_{1}r\_{\mathcal{c}} - \beta\_{2}r\_{T\_{R}} - a\_{T\_{L}} + I\_{T\_{L}} + n\_{T\_{L}}) \text{(1)} \end{aligned}$$

with a<sup>i</sup> , I<sup>i</sup> , and n<sup>i</sup> representing adaptation, external input, and noise for each population, respectively. The time constant τ was τ = 10 ms. β<sup>1</sup> is the cross-inhibition strength between population C and T (including T<sup>R</sup> and TL), while β<sup>2</sup> is the inhibition strength between T<sup>R</sup> and TL. The intensity of external input changes is represented with I<sup>C</sup> and I<sup>T</sup> = IT<sup>R</sup> = IT<sup>L</sup> .

The function S is a sigmoidal transducer of input-output function:

$$\mathcal{S}\left(\mathbf{x}\right) = \frac{1}{1 + \theta^{-(\mathbf{x}-\theta)/k}}\tag{2}$$

with threshold θ = 0.2 and k = 0.1.

The adaptation of firing activity was done through the terms aC, aT<sup>R</sup> , aT<sup>L</sup> and all followed the same time evolution:

$$
\pi \frac{da\_i}{dt} = -a\_i + \chi r\_i \tag{3}
$$

with τ = 2,500 ms, and a maximum strength of γ = 0.25 for all populations.

Noise input is modeled with an Ornstein-Uhlenbeck process as:

$$\frac{dn\_i}{dt} = -\frac{n\_i}{\mathbf{r}\_s} + \sigma \sqrt{\frac{2}{\mathbf{r}\_s}} \times \xi(t) \tag{4}$$

with τ<sup>S</sup> = 200 ms, σ = 0.08, and ξ (t) is a white-noise process whose mean value is zero with a standard deviation of one and no temporal correlations.

In this model (Huguet et al., 2014), we adjusted the crossinhibition strength values β<sup>1</sup> and β2, external input value I<sup>C</sup> and I<sup>T</sup> (IT<sup>R</sup> and IT<sup>L</sup> were set equal), noise strength value σ, and adaptation strength value γ to reproduce our behavioral results with other parameters remaining unchanged. The time window of simulations was set to 120 s, corresponding to the length of one block of measure in the psychophysical experiment, and repeated simulations were performed to obtain the mean and variability of the variables analyzed in the experiments.

Since we focused on the bistable condition, we report only transparent and coherent states by considering T<sup>R</sup> and T<sup>L</sup> as the transparent percept. A coherent percept was defined when r<sup>C</sup> was simultaneously higher than rT<sup>R</sup> and rT<sup>L</sup> and otherwise defined as transparent. For each 120 s of simulations, we computed the number of switches and durations of coherent and transparent states.

#### Data Analysis

For each 120-s trial, the number of percept changes was computed from the first report of a transparent percept to the end of the trial, as in work by Hupé and Rubin (2003). The dominance durations were measured between successive presses of the two keys. The duration of the last interrupted percept was not computed. The first percept was coherent in all trials (as reported in the debriefing), but in some conditions, a few subjects did not first press the "coherent" percept key, due to their knowledge of this appearance. Dominance durations were log10-transformed (Moreno-Bote et al., 2010).

Each dependent variable was analyzed with within-between analysis of variance, while all statistical levels used Geisser-Greenhouse epsilon-hat-adjusted values where appropriate. In the first analysis, the dependent variable was the number of keypresses for each condition, which allowed for the comparison of the frequencies of perception switches in different conditions and observers (amblyopes/normal observers). This analysis included the data from all subjects. In the second analysis, the dependent variable was the mean duration of the percept, with an additional within-subject factor in the ANOVA corresponding to coherent and transparent conditions. In this analysis, observers who were unable to see perceptual switches in at least one condition were not included due to lack of the corresponding variable. This phenomenon only appeared in 3 out of 15 anisometropic amblyopes (2 in Experiment 1 and 2 in Experiment 2) and 1 out of 17 NTE subjects (in Experiment 1), and it was mostly present for horizontal motion directions. We also calculated the mean value and standard deviation for each condition across all normal subjects and found that 1 of the 11 subjects in Experiment 2 had percept durations that deviated above 2 SD from the betweensubjects mean of the condition in 8 out of 24 conditions. In contrast, the other subjects had such deviations in a maximum of 2 conditions. For this reason, we also removed this subject data in the analysis of percept durations.

# RESULTS

#### Experiment 1

In the first experimental test, we measured the performance of each subject in three eye conditions (binocular, monocular with strong eye, and monocular with weak eye) with only a strong contrast of the gratings (30%) and global moving directions upwards and leftwards. We focused on the number of perceptual switches and mean duration of each percept type. Twenty subjects participated in this experiment; 10 of them were anisometropic amblyopes (AMB), and the remaining were neurotypical subjects (including two authors) that had no known visual deficits (NTE). During the experiment, all amblyopic subjects reported that they did not feel any difference between the fellow eye or binocular condition when using the amblyopic eye to watch the stimulus.

#### Frequency of Perceptual Switches

**Figure 1** illustrates the number of key-presses in each viewing condition for the two groups. There was a significant difference between the two moving directions [F(1, 18) = 15.865, p = 0.001] showing that, globally, the number of perceptual switches for the vertical motion directions were higher than for the horizontal directions. Eye viewing conditions also showed significant differences in perceptual switches [F(1.987, 35.758) = 5.836, p = 0.006], with the post-hoc Bonferroni test revealing a difference between the binocular and weak eye conditions [F(1, 18) = 10.860, p = 0.004]. Statistical analysis showed that there was no difference between the two groups of subjects [F(1, 18) = 1.061, p = 0.317], nor a significant interaction between the observer groups and the other factors (see **Table 2** for full ANOVA results).

#### Duration of the Two Percept Types in Different Conditions

**Figure 2** summarizes the results of the duration of the percepts. Statistical analysis showed that there was no difference between the two groups of subjects [F(1, 15) = 0.559, p = 0.466],

indicating that the mean perceived duration of each percept type was similar in normal and amblyopic people. A significant difference was found in the durations of each percept type [F(1, 15) = 10.925, p = 0.005], with duration of coherent percept being longer than the duration of the transparent percept, independent of the subject group (see **Figure 2A**). We also found significant differences in motion direction [F(1, 15) = 22.272, p < 0.001] with the mean of log10-transformed duration of horizontal direction being longer than that of the vertical direction (mean of horizontal = 0.673, mean of

Group (AMB/NTE). Error bars indicate between-subject SEM.

TABLE 2 | ANOVA results on Presses Number of Experiment 1.


vertical = 0.570) and a significant interaction between direction and group [F(1, 15) = 10.062, p = 0.006; see **Figure 2B**]. This last interaction was due to the much longer percept duration for the horizontal motion directions than for the vertical ones in AMB, while NTE exhibited similar values for both directions. There was also an interaction between eye condition and direction [F(1.927, 28.906) = 3.927, p = 0.031; **Figure 2C**]. For the horizontal direction, the means of the log10-transformed durations for each eye condition were similar but were distinct when the global motion direction was vertical. This difference may indicate that there are different strategies to address different motion directions. Additionally, with the change in the direction, the weak eye showed a relatively stable log10-transformed duration. Post-hoc Bonferroni-adjusted comparisons showed a difference between the weak eye and binocular condition in its interaction with direction [F(1, 15) = 8.787, p = 0.01]. No significant differences were found in other factors (see **Table 3** for complete ANOVA results).

#### Experiment 2

From the above Experiment 1 results, we observed that there were few differences between amblyopes and non-amblyopes in their perception of a bistable plaid motion stimulus. This outcome was unexpected because, based on previous reports of stronger noise in the motion amblyopic system (Simmers et al., 2006) and possibly a very different visual motion coding system in amblyopes (Thompson et al., 2012), we expected that motion rivalry, due to its keen sensitivity to internal noise and inhibition strength (Huguet et al., 2014), would result in strong systematic differences between the two observer types. Given the non-significant differences, we realized that our experimental design might have missed the effects because of the relatively high contrast of the gratings. Thus, if the activation of the motion system was too high such that the signal-to-noise (SNR) ratio was relatively large, then any internal noise differences might have gone unnoticed. Therefore, we performed a second experiment that was identical to the first in all aspects except that one more factor was added, the contrast of the stimuli, with two levels, high (30%) and low (5%) contrast. By decreasing the contrast, we expected that the SNR would also decrease, and differences between the groups would be observed, with a prediction that there would be a main effect of lower contrast in which the lowcontrast condition would be associated with more perceptual

FIGURE 2 | The mean log10-transformed percept durations showing (A) a main effect of percept type, (B) main effect of motion direction, and (C) an interaction between eye condition and direction. Mean of the log10-transformed percept durations expressed in seconds. C, coherent; T, transparent; NTE, Neurotypical/Normal; AMB, Amblyopes; Vrt, vertical; Hzt, horizontal; BE, binocular condition; AE/nDE, weak eye of subjects; FE/DE, fellow/dominant eye of subjects. Error bars indicate between-subject SEM.

TABLE 3 | ANOVA results on Mean of log-10 Durations of Experiment 1.


switches in amblyopes when compared to the high-contrast condition.

Twenty-one subjects participated in this experiment, with 10 anisometropic amblyopes (AMB; 5 of them also participated in Experiment 1), and the remaining were neurotypical subjects (NTE; 4 of them participated in Experiment 1).

#### Frequency of Perceptual Switches

Here, we still used the number of key-presses to represent the frequency of perceptual switches. Analysis included data from all 21 subjects (10 AMB and 11 NTE). **Figure 3** shows the main significant effects and interaction of how the press number increased with lower contrast and that the frequency of percept switches was globally lower in the weak eye condition than in the other conditions. There was no significant difference in the performance of normal and amblyopic subjects [F(1, 19) = 0.287, p = 0.598]. However, there was a significant difference in contrast [F(1, 19) = 5.575, p = 0.029], direction [F(1, 19) = 5.697, p = 0.028], and eye condition [F(1.904, 36.171) = 4.446, p = 0.020]. The number of presses increased with the decrease in contrast, potentially due to an increase in internal noise or, equivalently, a decrease in the signal-to-noise ratio. Upon examination of the effects of the global direction of motion, both groups had higher percept switches when stimuli were moving upward (as in Experiment 1). Post-hoc comparisons (Bonferroni-corrected) for eye conditions showed a difference between the binocular and weak eye conditions [F(1, 19) = 6.426, p = 0.02] and a difference between the weak and strong eye conditions [F(1, 19) = 5.472, p = 0.03].

An interaction between contrast and eye condition was also found in this case [F(1.904, 30.537) = 5.492, p = 0.013]. However, no other interactions were significant (see **Table 4** for complete ANOVA results).

#### Duration of Two Percept Types in Different Conditions

Here, we analyzed the duration of both percept types (i.e., coherent and transparent) for different contrast, eye, and moving direction conditions and whether there were differences between neurotypical subjects and anisometropic amblyopes; 2/10 AMB were not included because of at least one condition with no percept switch, and 1/11 NTE was excluded as an outlier (see section Methods).

**Figure 4A** illustrates the durations of both direction and eye conditions for subject groups and stimulus contrast conditions. Statistical analysis showed that there were no differences between the two groups of subjects [F(1, 16) = 0.298, p = 0.593], indicating that globally, percept durations were similar in normal and amblyopic people. Significant differences were found across contrast conditions [F(1, 16) = 5.173, p = 0.037] and percept type [F(1, 16) = 19.241, p = 0.0005; **Figures 4B,C**]. Lower contrasts globally decreased percept duration, paralleling the increase in number of switches. The duration in the coherent percept was always longer than that in the transparent percept regardless of subject group (**Figure 4C**).

FIGURE 3 | Main significant effects and interaction in Experiment 2 for number of percept switches. (A) Interaction plot for contrast and eye conditions. DE (dominant eye) and FE (fellow eye) corresponded to the strong eye for normal and amblyope observers, respectively; nDE (non-dominant eye) and AE (amblyopic eye) corresponded to the weak eye for normal and amblyope observers, respectively. BE was the binocular condition. (B) Main effect of global motion direction. Mean number of switches was higher in the vertical condition (Vrt) than in the horizontal condition (Hzt). Error bars indicate between-subjects SEM.

TABLE 4 | ANOVA results on Presses Number of Experiment 2.


ANOVA also showed significant interactions between subject groups and contrast condition [group vs. contrast, F(1, 16) = 9.326, p = 0.008; **Figure 4B**]. In NTE, percept duration decreased with a decrease in contrast, while amblyopes had no clear variation. This effect suggested that amblyopes seem to have a different motion processing mechanism from NTE. Another interaction showed a significant effect of the contrast and eye condition [F(1.973, 31.575) = 4.420, p = 0.021; **Figure 4D**]. The performance in the binocular condition and stronger eye condition was similar across contrast conditions, while results differed according to contrast when the observer was using the weak eye to do the task. In this latter viewing condition, duration was slightly decreased when contrast increased, and the duration was always longer than the duration in the other two eye conditions. Thus, this interaction was mainly caused by the weak eye. Post-hoc Bonferroni-corrected comparisons for interaction between contrast and eye conditions showed that the dominant/fellow eye had a strong tendency for resulting in a different outcome than the binocular viewing condition [F(1, 16) = 4.463, p = 0.051], while the weak eye had a different outcome than the binocular condition [F(1, 16) = 7.624, p = 0.014]. No other effects were significant (**Table 5**).

#### Experiment 3: Control of Contrast Effects

We performed a control experiment to cross-check the effect of contrast in a different manner. We measured 5 AMB and 6 NTE (all participated in Experiment 1 or Experiment 2) in only the vertical condition to avoid a low number of switches with 6 levels of contrast (0.03, 0.05, 0.1, 0.15, 0.35, 0.5) with the hypothesis that the AMB should exhibit no variation with contrast, while the NTE should show an increase in the number of switches with a lower contrast. The results showed a clear interaction between the linear slopes of the number of switches versus contrast in AMB and NTE [group vs. contrast: F(1, 8) = 11.9, p = 0.009], with the slope from AMB not different from zero (b = 4.2, CI = [−11.74, 20.22], R <sup>2</sup> = 0.12, p = 0.502) and a significantly negative slope from NTE (b = −18.27, CI = [−26.76, 9.78], R <sup>2</sup> = 0.90, p = 0.0039; see **Figure 5**). These results were also present when analyzing overall mean percept duration vs. contrast [Group vs. Contrast: F(1, 8) = 9.31, p = 0.016; **Figure 5**]. The results were nearly identical when regressing in log-contrast space (number of switches vs. log-contrast, interaction group vs. contrast: F(1, 8) = 11.898, p = 0.009; percept duration vs. log-contrast, interaction group vs. contrast: F(1, 8) = 9.037, p = 0.017].

In summary, as expected, we found that contrast affected percept switches and percept durations by increasing the number of switches and decreasing the durations of the percepts with lower contrasts of gratings. In line with our expectation, this effect was mainly observed in NTE, and AMB showed no clear changes in percept duration with changes in contrast. Thus, based on our original hypothesis of decreased SNR with lower stimulus contrast, AMB seemed to show weak changes in plaid motion perception when contrast of the stimulus varied.

#### Correlation Between Bistability and VA or SA

We tested the correlation of the classic visual deficits as measured with the visual acuity (VA) and stereo acuity (SA) tests with the strength of bistability as measured through the number of switches. **Table 6** shows that there were no significant correlations for all monocular conditions in the amblyopic group in both Experiments 1 and 2.

#### Model Predictions of Bistable Motion Perception and Consequences for the Amblyopic Visual Motion System

We used the tristable model defined by Huguet et al. (2014) to identify the plausible internal mechanisms underlying the results of Experiment 2. Because these authors argued and presented evidence that moving plaid stimuli consist of not two but three

and eye conditions. (D) Main effect of percept type. Note there was no difference between amblyopic and normal subjects. Error bars indicate between-subjects SEM.


different percepts, i.e., the transparent condition with two clearly perceived sliding gratings can have two states with different depth orderings, and that there are perceptual switches across the three states, we considered this model as more relevant to our experiments even though the experimental task was only a simple dual report of either transparent or coherent motion. Their model incorporates three populations of neurons that code three possible percepts: coherence (C), transparent with the leftward (counterclockwise) moving grating on top (TL), and transparent with the rightward (clockwise) moving grating on top (TR); in the use of the model here, we considered the transparent state (T) only when the C state was not active. A schematic of the model is presented in **Figure 6**, and it contains 6 parameters (β1, β2, γ, σ, IC, I<sup>T</sup> = ITL = ITR). The model is used in a range of parameters providing winner-takes-all behavior where only one of the three populations can be active at a given time, thus representing the active percept. Competitive inhibition between the three neuronal populations, together with spike-frequency adaptation and internal noise, provide the substrate for perceptual switches between the percepts.

As described in Huguet et al. (2014), the model parameters play essential roles in determining the mean number of percept switches and their duration. We parametrically varied the parameters in order to understand their effects on the two main measures. **Figure 6** presents representative simulation results for model parameters of β<sup>1</sup> = 0.9, β<sup>2</sup> = 0.7, σ = 0.06, γ = 0.2, I<sup>C</sup> = 1, and I<sup>T</sup> = ITL = ITR = 0.9IC, when varying one of the last four parameters. An increase in internal noise σ strongly increases the number of percept switches and concurrently decreases the durations of the two percepts of C and T states (**Figure 7A**). An increase in the adaptation strength γ also increases the number of perceptual switches but differentially affects the C and T states (**Figure 7B**), with the C state duration showing a stronger relation (decrease) to an increase in adaptation than the T state, making C durations longer than the T duration at low γ and the reverse pattern observed with stronger γ. When the input strength is varied (with relative input T-to-C as constant; **Figure 7C**), the number of percept switches rapidly decreases at low inputs, corresponding to rapid increases in the signal-to-noise ratio. However, the number of percept switches is also observed to exhibit a minimum after which it begins to increase again. From multiple simulations, we found that this minimum was strongly dependent on the relative input strengths (IT/IC) as well as on the inhibitory strengths (β1, β2; results not shown). The durations of the two types of percepts, C and T, concurrently changed with a strong change in the number of switches. The percepts also showed a change in their relative durations with low input strengths showing T states longer than C ones and a reversal at higher input values. Finally, a change in the relative strength between C and T inputs demonstrated a typical bell-shaped curve for the number of switches (Brascamp et al., 2015), with the maximum value near input equality, together with their concurrent C and T state duration changes (**Figure 7D**). These last effects mimicked the expected effects of relative input strengths onto the two variables as observed in previous reports (Moreno-Bote et al., 2010; Brascamp et al., 2015).

Similar observations were obtained for other inhibitory strengths (β1, β2) but with the absolute values of noise, input, adaptation, and relative input strengths correspondingly changed.

The above simulations show two important effects. First, the number of perceptual switches and percept durations are very sensitive to the internal noise and adaptation strength (**Figures 7A,B**). This observation supports the original hypothesis that plaid gratings would show differences between the two groups of subjects that putatively have different noise levels in their motion visual system (Mansouri and Hess, 2006). In contrast to this prediction, Experiment 1 did not show any differences between AMB and NTE. Second, a striking effect was present in the simulation for the absolute input strengths I<sup>C</sup> and I<sup>T</sup> that represent the inputs of the C and T states. At very low input levels, the internal noise of the system is much stronger than the input strengths and thus makes the system oscillate much faster between the two states. This effect is in line

FIGURE 5 | Linear regression across different contrast conditions in AMB and NTE. Top showing percept switches; bottom showing percept duration. Left column graphics for AMB; right column for NTE. Solid line was the best-fit line, while the dashed line indicates the 95% confidence band of the best-fit line. Error bar indicates the SEM.

TABLE 6 | Correlation between bistablity and SA or VA.


with our hypothesis that lower grating contrasts would increase the number of switches and percept durations, which led us to perform Experiment 2 with the idea that AMB should exhibit an increase in the number of switches and also show a decrease in the durations of the percepts. However, the results differed from our expectation, with NTE showing the predicted effect, but AMB showing no changes with lower grating contrasts.

### DISCUSSION

We investigated putative differences in the visual motion system between anisometropic amblyopes and neurotypical observers through the use of bistable plaid motion perception. First, our group of amblyopes globally exhibited normal bistable perception in any viewing condition (binocular, monocular with amblyopic or fellow eye) when compared to the control group. Second, we hypothesized that lower contrast of the plaid stimulus should emphasize the internal noise differences between the two groups and thus lead to a stronger increase in percept switches and decrease in percept durations. The results confirmed this hypothesis only in the control group, while the amblyopic group exhibited no changes. These latter results are at odds with the idea of stronger noise in the amblyopic motion system, and plausible explanations of these discrepancies are discussed below.

Bistable perception of plaid square gratings was found to be normal in anisometropic amblyopes when compared to that in the neurotypical controls. These results are in agreement with

Spike-frequency adaptation is present in each population. The function S() represents the sigmoidal transducer.

previous reports of normal perception of bistable sine-grating plaids in such group of subjects (Thompson et al., 2008, 2012; Hamm et al., 2014), even when first-order contrast deficits are taken into account (Tang et al., 2012). In our study, these earlier reports are confirmed through analysis of perceptual bistability applied on square gratings.

While bistability of the percepts was similarly seen and stochastic across eye-viewing conditions and groups of subjects, our methods and results unveiled a new and unexpected effect of contrast on plaid motion perception in amblyopes. Based on reports of possibly stronger internal noise in the amblyopic visual motion system (Simmers et al., 2003; Mansouri and Hess, 2006; Hamm et al., 2014) and theoretical insights into perceptual bistability and neural noise (Brascamp et al., 2006; Moreno-Bote et al., 2007; Shpiro et al., 2009; Huguet et al., 2014), lower contrasts of the stimulus were argued to decrease the duration of each percept in amblyopes when compared to that in the control group. This effect was found, but it was reversed between groups, with the control group showing decreased percept stability (decrease in percept durations), while the amblyopes did not exhibit such an effect.

This result is interesting in at least two aspects. First, contrast sensitivity, the reciprocal of contrast threshold that is used to describe subjects' ability to visually detect a target, is known to be strongly affected in amblyopic eyes (Woodruff, 1991). Earlier research has shown that contrast sensitivity is highly decreased in the amblyopic eye, especially at high spatial frequencies, but the sensitivity of the fellow eye is also affected when compared with the eyes in normal subjects (Bradley and Freeman, 1981). Interestingly, amblyopes do not exhibit clear deficits in contrast perception at suprathreshold stimulus contrasts, indicating that there is no clear contrast coding abnormality for the suprathreshold contrast range in amblyopes (Hess and Bradley, 1980; Loshin and Levi, 1983). On the contrary, suprathreshold static grating perception is affected but in a very different manner. Amblyopes staring at images of classic square gratings perceive perceptual distortions of the stimulus that could be of static or dynamic nature (Hess et al., 1978; Sireteanu et al., 2008; Thiel and Iftime, 2016). Thus, the two facts that (1) our group of amblyopes perceived the 120-s moving plaids normally, with classic perceptual bistability and no reports of differences in perception between the weak and fellow eyes, and (2) amblyopes did not show an effect of contrast on the global bistability of the percept hint to a motion coding system in their visual pathway that uses dynamic visual input in a different way from neurotypical subjects. The results of neurotypical subjects experimentally confirmed the inversed "Levelt IV rule" at low contrasts (Brascamp et al., 2015), but the overall pattern of results led us to consider in further detail the models of plaid motion perception and a plausible explanation of the effects observed in amblyopes.

In analyzing and applying a model (Huguet et al., 2014), we found that input intensity indeed affected percept switches and durations as hypothesized. These effects also suggested that, for amblyopes, contrast of the stimulus is decoupled from or very weakly related to the "input" variable of the model. This suggests that there may be different motion coding system in the amblyopic visual system from that in the neurotypical one, with the perceptual switches observed in the former visual motion system related to different mechanisms.

From a neurophysiological perspective, motion coding and decoding of plaid stimuli might not be performed at a single stage, but instead, multiple areas may be involved (Thompson et al., 2012; Villeneuve et al., 2012). Thus, the segregation of motion (transparency) or the assimilation of motion (coherency) may be coded in a distributed manner across the early cortices. The differences between our amblyopic and control groups in contrast effects might stem from the fact that, in the amblyopic system, motion coherency and transparency coding could be more widely distributed than in neurotypical subjects, as suggested by a recent study (Thompson et al., 2012). From a different and more detailed perspective, the major motion area MT is known to contain cells that can selectively respond to the pattern or components of moving plaid gratings (Rust et al., 2006) and, furthermore, has some depth coding structure (Born and Bradley, 2005) that should help to create depth ordering of different motion surfaces. Although MT cells in the macaque monkey seem to have dominance over fellow eye inputs, the distribution of cells sensitive to pattern and the components of plaid gratings were found equal (El-Shamayleh et al., 2010), thus showing global similar plaid motion coding. Therefore, we might assume that the equivalent percepts of coherence and transparency are decoded through a simple rule: to decode only one neuronal population—component or pattern cells. Because MT cells receive major input from V1 cells, the contrast dependence of all MT cells should be similar. The observation in control subjects of stronger perceptual changes at lower contrast supports the idea that pattern and component cells should be similarly activated by contrast strength. On the other hand, the lack of contrast effects in amblyopes seems to indicate that pattern and component cells have different input relations to the contrast of the stimulus. This difference provides an interesting possibility and its exact nature is far from the scope of the current study.

Importantly, the model used here is more qualitative in nature, helping to grasp essential structural differences and changes in the multistable perception of plaid motion stimuli but not providing a realistic implementation of motion coding. Recent studies reported that, closely related to our work, tristable motion perception could be explained by a more detailed motiontuned neuronal population (Meso et al., 2016; Medathati et al., 2017) that more closely resembles MT physiology. Further investigations and theoretical modeling also incorporating depth coding should help to unravel the plausible changes in the amblyopic motion system.

A systematic and interesting difference we found was the global direction effect. Both amblyopes and normal subjects had more percept switches when global motion direction was upward, i.e., vertical, than when it was horizontal. We did not find systematic effects between the two groups across the first two experiments. Differences between cardinal axes have already been reported in previous studies of visual motion perception in ambiguous conditions (Castet et al., 1999; Hupé and Rubin, 2004). The exact nature of the asymmetry in bistability between vertical and horizontal global motions may lie in the eye movement differences between these two cardinal directions. The global effect present across all observers might stem from clear differences in eye movement dynamics of horizontal and vertical eye movement (fixational, reflexive, or voluntary pursuit eye movements) (Baloh et al., 1988; Sparks, 2002). This explanation partly supports a separate control of vertical and horizontal pursuit, which may contribute to the direction difference that is systematically reported. Furthermore, eye movement may influence the percept through retinal motion. Van Dam et al. demonstrated that the retinal image shift, caused by saccade, can change the bistable percept (van Dam and van Ee, 2005, 2006). For clarification of the exact mechanism of such a direction effect and determination of whether amblyopes with clear changes or deficits in eye movements exhibit an effect on perception of plaid motion, further studies are still needed with proper measures and controls for eye movements in neurotypical and amblyopic groups.

In summary, by using bistable plaid motion as a probe of the visual motion system, we found a systematic and clear effect of stimulus contrast on perceptual bistability in neurotypical subjects that was not present in anisometropic amblyopes. The former effect is explained by classic models of multistability and thus hints toward a generally different motion coding and decoding system in the amblyopes.

#### REFERENCES


#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of Ethical review of biomedical research involving human beings, Committee on biomedical ethics of university of science and technology of China with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Committee on biomedical ethics of university of science and technology of China.

#### AUTHOR CONTRIBUTIONS

JL carried out the experiments, and wrote the manuscript with the support of TT and YZ. Both JL and TT contributed to the design and implementation of research, to the analysis of results. YZ supervised the project. All the authors reviewed the manuscript and conceptualized this study.

### ACKNOWLEDGMENTS

This study was supported by the National Natural Science Foundation of China (NSFC 31230032 and 91749102 to YZ) and the Fundamental Research Funds for the Central Universities (TT).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Liu, Zhou and Tzvetanov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Retinotopic Spiking Neural Network System for Accurate Recognition of Moving Objects Using NeuCube and Dynamic Vision Sensors

#### Lukas Paulun1,2, Anne Wendt <sup>1</sup> \* and Nikola Kasabov <sup>1</sup>

<sup>1</sup> Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland, New Zealand, <sup>2</sup> Mathematical Institute, Albert Ludwigs University of Freiburg, Freiburg im Breisgau, Germany

This paper introduces a new system for dynamic visual recognition that combines bio-inspired hardware with a brain-like spiking neural network. The system is designed to take data from a dynamic vision sensor (DVS) that simulates the functioning of the human retina by producing an address event output (spike trains) based on the movement of objects. The system then convolutes the spike trains and feeds them into a brainlike spiking neural network, called NeuCube, which is organized in a three-dimensional manner, representing the organization of the primary visual cortex. Spatio-temporal patterns of the data are learned during a deep unsupervised learning stage, using spiketiming-dependent plasticity. In a second stage, supervised learning is performed to train the network for classification tasks. The convolution algorithm and the mapping into the network mimic the function of retinal ganglion cells and the retinotopic organization of the visual cortex. The NeuCube architecture can be used to visualize the deep connectivity inside the network before, during, and after training and thereby allows for a better understanding of the learning processes. The method was tested on the benchmark MNIST-DVS dataset and achieved a classification accuracy of 92.90%. The paper discusses advantages and limitations of the new method and concludes that it is worth exploring further on different datasets, aiming for advances in dynamic computer vision and multimodal systems that integrate visual, aural, tactile, and other kinds of information in a biologically plausible way.

Keywords: Spiking neural networks (SNN), NeuCube, dynamic vision sensor (DVS), MNIST-DVS, retinotopy, deep learning in SNN

# INTRODUCTION

During the past years, the quest for accurate image recognition systems has been one of the driving forces behind major advances in the field of artificial neural networks such as the development of convolutional neural networks (Lecun et al., 1998). Today, algorithms for image recognition are well advanced and can be found in many applications such as search engines, security systems, industrial robots, medical devices, and virtual reality. Besides the many areas of application, another reason for the fast progress in image recognition might be the vast knowledge about the human visual system. The eye is arguably the best studied human sensory organ and the visual cortex has

#### Edited by:

Xavier Otazu, Universidad Autónoma de Barcelona, Spain

#### Reviewed by:

Timothée Masquelier, Centre National de la Recherche Scientifique (CNRS), France Pablo Martinez-Cañada, Universidad de Granada, Spain

#### \*Correspondence:

Anne Wendt anne.wendt@aut.ac.nz

Received: 11 September 2017 Accepted: 24 May 2018 Published: 12 June 2018

#### Citation:

Paulun L, Wendt A and Kasabov N (2018) A Retinotopic Spiking Neural Network System for Accurate Recognition of Moving Objects Using NeuCube and Dynamic Vision Sensors. Front. Comput. Neurosci. 12:42. doi: 10.3389/fncom.2018.00042 been the main object of interest in a large number of neuroscientific studies. Findings from vision science have inspired the development of new hardware as well as novel algorithms and computational tools. High-definition and highspeed cameras have long surpassed the capacities of the human eye in terms of spatial and temporal resolution. On the software side though, it still proves to be a difficult task to extend the scope of present achievements in static image recognition to dynamic visual recognition of moving objects or a moving scene.

The benefit of accurate and fast dynamic visual recognition is apparent: each of the above-mentioned applications of image recognition constitutes a potential application area for dynamic visual recognition systems. Any kind of robot that must navigate within a three-dimensional environment or perform tasks on moving objects would benefit from an accurate and fast dynamic visual system. The popular topic of self-driving cars is only one example. Other potential implementations include security systems, automated traffic prediction and tolls, monitoring of manufacturing processes, navigational tools in air and ship traffic, or diagnostic assistants for inspections or surgery. Since the human visual system's adaptability and efficiency are still highly superior to computer systems when it comes to tasks of dynamic vision, it is natural to let biology serve as an inspiration for the development of new computational models.

Previous works have used a combination of bio-inspired visual sensors and spiking neural networks for the recognition of human postures (Perez-Carrasco et al., 2010), the extraction of car trajectories on a freeway (Bichler et al., 2012), or the control of robotic movements (Jimenez-Fernandez et al., 2009; Perez-Peña et al., 2013). We consider these very promising approaches, though the mentioned works lack benchmarking results that make them comparable.

This paper introduces a new system for dynamic visual recognition that combines a silicon retina device with a brainlike spiking neural network (SNN). As we introduce the different parts of our proposed system, we include findings from vision science that inspired us or that might provide promising approaches for future improvements. We present the setup and the results of a benchmarking experiment carried out on the MNIST-DVS dataset and show that our system achieves a classification accuracy of 92.90% on this dataset. The SNN architecture NeuCube is very flexible in terms of its connectivity and learning algorithms and allows for the visualization of the learning processes inside the SNN. After discussing the advantages and limitations of the system, we conclude by suggesting further exploration of the system's performance with modified algorithms and different datasets.

# THE PROPOSED SYSTEM ARCHITECTURE

#### The Dynamic Vision Sensor

The Dynamic Vision Sensor (DVS) was developed at the Institute for Neuroinformatics in Zürich as a fast and storage efficient silicon retina system (Delbruck, 2008). Unlike conventional frame-based video cameras that capture multiple frames per second and store a large number of pixels for each of these frames, the DVS only captures changes in the brightness of single pixels caused by movement of the scene or an object (Lichtsteiner et al., 2008). This is called an Address Event Representation (AER) since the output of the sensor consists of a time series of events together with their location (address), representing the temporal contrast of a specific pixel at a specific time. By responding to temporal contrast on the pixel-level rather than taking a continuous series of snapshots of the whole scene, the DVS mimics the functioning of the human retina much better than conventional video cameras (Purves, 2012).

Together with its focus on movements within a scene there is another reason to choose the DVS over a conventional video camera for a dynamic vision system based on a spiking neural network: the address event output of the DVS comes in the form of a series of spike trains, each spike train corresponding to one pixel of the sensor. Every single spike in the train of one specific pixel represents a change in brightness in that pixel at a specific time. However, there are two difficulties with taking the raw DVS output as spike trains and directly feeding them into a spiking neural network: firstly, the sensor can achieve a very high temporal resolution of 1 µs and a spike train for a single pixel will initially consist of many time steps, e.g., 2,000,000 time steps for a 2 s video, and a relatively small number of spikes. Feeding such a spike train into a spiking neural network would result in very low overall spiking activity and probably unsatisfying performance. Secondly, although the sensor's spatial resolution of 128 × 128 = 16,384 pixels is low compared to conventional video cameras, it is desirable to reduce computational cost by integrating the signals of multiple pixels into single input neurons for the SNN rather than creating 16,384 input neurons.

For this purpose, we propose an algorithm for the compression of time and the convolution and pooling of the DVS pixels into a total of 128 spike trains consisting of roughly 100 time steps for each second of video data that can then be fed into 128 input neurons of an SNN.

# Proposed Encoding Algorithm of DVS Data as Input Data for the SNN System

The algorithm we propose is inspired by the structure and organization of retinal ganglion cells. These cells receive information from photoreceptors on the retina and transmit them to the brain (Purves, 2012). There are different types of retinal ganglion cells, but we focus on two global properties shared by the majority of all ganglion cells: first, the distribution of retinal ganglion cells across the retina, which is used to determine which photoreceptors converge into one retinal ganglion cell and, thus, how many DVS pixels converge into one input neuron for our SNN. Second, the mechanism by which retinal ganglion cells fire and, thus, the algorithm that generates the input spike trains for the SNN.

#### Pooling of DVS Output Into 128 Input Neurons of the SNN System

Despite large differences across individuals, there are roughly 100 million photoreceptor cells on the retina and around 1 million retinal ganglion cells providing information transmission to the brain (Curcio et al., 1990). Thus, on average, one ganglion cell integrates information from roughly 100 photoreceptor cells. However, the number of photoreceptors converging into one ganglion cell depends highly on the retinal location of the photoreceptors. Ganglion cells connecting to the fovea centralis, the small central spot of the retina specialized in sharp and detailed vision, receive information from only a single photoreceptor cell, implying that information from these photoreceptors is transmitted directly to the brain without any pooling (Purves, 2012). The receptive fields of ganglion cells increase with distance from the fovea and ganglion cells connecting to peripheral parts of the retina integrate the signals of many photoreceptors at once (Croner and Kaplan, 1995).

The way our encoding algorithm pools information from multiple DVS pixels into single spike trains adapts this property of detailed information transmission from central parts of the retina and averaging over larger numbers of photoreceptors in the periphery. Overall, the algorithm generates 128 spike trains that will serve as input for the SNN. Each spike train represents one retinal ganglion cell with its own receptive field on the 128 × 128-pixel output of the DVS (**Figure 1**).

In our algorithm, the central 8 × 8 pixels of the DVS output represent the fovea (**Figure 1A**), and for each of these central 64 pixels, there is a single ganglion cell only considering the output of that single pixel. Furthermore, there are four groups of 16 ganglion cells each, with receptive fields that increase from the center to the periphery. The first group consists of the central 16 × 16 pixels, divided into 16 squares that integrate an area of four by four pixels each (**Figure 1B**). The next group consists of the central 32 × 32 pixels, again divided into 16 squares, this time with an area of 8 × 8 pixels each (**Figure 1C**). The same happens for the central 64 × 64 pixels (**Figure 1D**) and the total of 128 × 128 pixels (**Figure 1E**), resulting in 16 squares per group, of size 16 × 16 and 32 × 32, respectively. In this pooling mechanism, an average of 170.5 pixels converge into one ganglion cell. The size of the receptive fields can easily be adapted to higher or non-square video resolutions.

Having set the distribution of the ganglion cells across the DVS output, the next step is to determine how the information of the DVS pixels is encoded into spike trains for the ganglion cells.

#### Firing Mechanism

The Dynamic Vision Sensor provides a very high temporal resolution of up to 1 µs. Preserving is detailed temporal information is desirable from a computational point of view, but as described below we reduce this resolution to 10 ms to maintain biological plausibility. While some spike encoding algorithms like Poisson models focus merely on the spike count within a given time interval and disregard the exact spike timing, it has been shown that the spike timing of mammalian retinal ganglion cells conveys several times more information than the spike count (Berry et al., 1997; van Rullen and Thorpe, 2001; Uzzell and Chichilnisky, 2004). Furthermore, retinal ganglion cells fire very briefly as a response to specific stimuli rather than emitting a high frequency of background firing. Spikes emitted by retinal ganglion cells of rabbits and salamanders, presented with random flicker, covered less than 5% of the total stimulus time (Berry et al., 1997). The maximum firing rate of retinal ganglion cells varies between different animal species and depends on the type of visual stimuli. Transient peak rates of up to 250 Hz have been observed in retinal ganglion cells of mice (Krieger et al., 2017), but for the sustained firing of human retinal ganglion cells, an upper bound of 100 Hz can be reasonably assumed (Nelson, 1995).

As described in section The Dynamic Vision Sensor, the DVS output consists of a series of events, including their timing in microseconds and their location in pixel coordinates. In fact, each event also includes a polarity of +1 or −1, depending on whether the event indicates a pixel becoming brighter or darker. Our encoding algorithm ignores the event polarity, but it might be worthwhile for future experiments to consider a translation of positive and negative events into positive and negative spikes.

Our spike encoding algorithm is illustrated in **Figure 2**. In the first step, the algorithm takes the time series of the DVS and groups it into windows of 10,000 µs or 10 ms. The new time series consists of 10 ms steps, and for every ganglion cell, it must be decided at which of these steps the cell will fire. Since each time step represents 10 ms of video data, the maximum firing rate of the ganglion cells cannot exceed 100 Hz. The encoding for the central 64 pixels that represent the fovea is straightforward: if there is at least one event for a pixel at time step t<sup>i</sup> , the ganglion cell that corresponds to that pixel will fire at t<sup>i</sup> . There are no parameters to tune for these central 64 pixels and the spike trains of the ganglion cells that correspond to these pixels are completely determined by the DVS output. For the 64 ganglion cells that integrate the events of multiple DVS pixels, the situation is slightly different. For each of these cells, the algorithm counts how many events occurred in each time window within the receptive field of that ganglion cell. If the number of events from pixels within the receptive field of cell C<sup>j</sup> at time step t<sup>i</sup> exceeds a certain threshold, C<sup>j</sup> will fire at ti .

Theoretically, this threshold can be set for each ganglion cell individually, but since the 16 cells of each group have receptive fields of the same size, our algorithm assigns the same threshold to all 16 cells of a group, resulting in a total of 4 thresholds that can be tuned. Clearly, the value of the thresholds will determine the average spike rate of the final spike trains, with higher thresholds leading to fewer spikes, and it is possible to imitate biological evidence about spike rates under certain stimuli. We discuss the tuning of the thresholds in more detail in section Model Design and Implementation.

Inspired by the structure and organization of retinal ganglion cells, our algorithm pools 128 × 128 DVS pixels into 128 ganglion cells that will serve as input neurons for the SNN. The algorithm compresses the microsecond resolution of the DVS output into time steps of 10 ms, but it preserves the timing of the DVS events instead of generating a Poisson process with random spike timing. The next section describes the structure of a brainlike SNN architecture called NeuCube, and our imitation of the retinotopic mapping of retinal ganglion cells into the visual cortex.

receptive fields are marked with orange frames.

receptive fields of all 128 ganglion cells are counted. If the number of DVS events within the receptive field of one ganglion cell exceeds a certain threshold, the cell fires at that time step.

# The Brain-Like SNN Neucube and the Proposed Retinotopic Mapping

The NeuCube SNN architecture incorporates several different principles of SNN and combines them into a single model for mapping, learning, and understanding of spatio-temporal data (Kasabov, 2014). Signals are processed along successive stages as shown in **Figure 3**. Before going into detail about the learning algorithms used by NeuCube, we want to focus on the threedimensional structure of NeuCube and the bio-inspired way we mapped the 128 input neurons into this structure. Our system

uses a NeuCube initialized with 732 neurons, using the MNI coordinates of neurons from the primary visual cortex (V1, Brodman area 17), taken from the Atlas of the Human Brain (downloaded together with the xjView toolbox: http://www. alivelearn.net/xjview). The number of neurons is only bounded by computational limitations; it is possible to add further neurons from the secondary or tertiary visual cortex or to represent the whole brain. Initial connections between the neurons are based on the "small-world" paradigm, where random connections are formed within a pre-defined maximum distance of each neuron, 80% of the time as excitatory and 20% of the time as inhibitory connections. The mapping of the 128 input neurons into the 732 neurons of NeuCube mimics two important characteristics of the human visual cortex: cortical magnification and retinotopic mapping (**Figure 4**).

Cortical magnification describes the overrepresentation of foveal signals inside the primary visual cortex. Although the fovea has a diameter of only 1.2 mm (Purves, 2012), its signals are processed by almost 50% of all neurons in V1 (Krantz, 2012; Born et al., 2015). Therefore, we chose exactly 64 of our 128 input neurons to correspond to the central 64 DVS pixels with a one-toone relationship. This way, 50% of input neurons automatically correspond to the central pixels of the DVS, just like 50% of the primary visual cortex correspond to the central photoreceptors on the retina.

The second characteristic of the primary visual cortex that we adopted in our mapping is the preservation of spatial relationships between photoreceptors on the retina and their neural representation in the primary visual cortex, the so-called retinotopy (Rosa, 2002). Signals from the top left of our visual field are mapped to the bottom right of V1 and vice versa. What humans see is flipped upside down and mirrored, but objects that appear next to each other in the visual field will still be represented next to each other in V1. Both the foveal as well as the peripheral ganglion cells follow this principle, although foveal signals are mapped into the posterior part and peripheral signals into the anterior part of V1 (Purves, 2012). **Figure 5** shows how the principle of retinotopy is applied to the mapping of the 128 input neurons to the 732 neurons of NeuCube.

# Unsupervised and Supervised Learning of Dynamic Visual Patterns in the Neucube Architecture

Learning in the NeuCube is performed in two stages: in the first step, unsupervised learning is performed to modify the initial connection weights. In our system we use pairbased multiplicative spike-timing-dependent plasticity (STDP, van Rossum et al., 2000), but in principle, the NeuCube architecture allows for a flexible implementation of different learning algorithms. The SNN will learn to activate the same groups of spiking neurons when similar input stimuli are presented and to change existing connections that preserve the spatio-temporal patterns of the input data (Kasabov and

Capecci, 2015). Previous works have shown that STDP is well suited to train neurons to respond to discriminative visual features (Masquelier and Thorpe, 2007). The neurons become selective to successive coincidences of particular patterns and learn to detect them robustly even in the presence of noise (Masquelier et al., 2009). Our approach using the NeuCube differs from these works mainly in the structure of the network, which is not based on layers, but rather a three-dimensional network shaped like the primary visual cortex. However, our results are similar to those works in that certain neurons and connections can be identified that seem to play a major role in discriminating between the different classes. NeuCube allows for a visualization of the learning process and we discuss how the visualization can be used for a better understanding of the data and the neural processes after presenting our experimental results.

In the second step, supervised learning is applied to the spiking neurons in the output classification module, where the same spike trains used for the unsupervised training are now propagated again through the trained SNN and output neurons are generated and trained to classify the spiking activity of the SNN into pre-defined classes (Kasabov and Capecci, 2015). Again, the NeuCube architecture allows for the application of different algorithms for the evolving classifier. The output function we used is called the dynamic evolving SNN algorithm (deSNN, Kasabov et al., 2013), which makes use of rank-order learning (Thorpe and Gautrais, 1999). This kind of evolving classifier is computationally inexpensive and puts emphasis on the order in which input spikes arrive, making it suitable for online learning and early prediction of temporal events (Kasabov, 2014). Similar to previous works on image recognition based on reward-modulated STDP (Mozafari et al., 2017), the deSNN algorithm uses a "highest" layer of neurons to discriminate between classes. While Mozafari et al. (2017) used an existing layer of output neurons, the deSNN algorithm creates and trains one new output neuron per sample by connecting it to all 732 neurons in the network and propagating the signal through the network once more. The connection weights that are learned in this process are then classified using a K-nearest neighbor (KNN) algorithm and the labels that are known for all the samples. Here our method differs from the aforementioned (Mozafari et al., 2017) in that we do not apply "anti-STDP" for misclassified samples before applying KNN. This means that the results of the deSNN's decisions are not fed back into the network since we create a new output neuron for each sample.

For a more detailed description of the NeuCube architecture see Kasabov (2014).

#### Summary of the Proposed Methodology

The methodology we propose for dynamic visual recognition consists of the following steps:


We present the application of this method on a benchmarking experiment with the MNIST-DVS dataset for spikebased dynamic visual recognition and go into further detail about the tuning of parameters and analysis of the SNN.

# BENCHMARKING ON THE MNIST-DVS DATASET

# Description of the MNIST-DVS Dataset

The MNIST dataset of handwritten digits (Lecun et al., 1998) has been one of the most popular benchmarking datasets for image recognition for over 20 years. With the advent of spiking neural networks, MNIST has naturally been used as a benchmark for spike-based visual recognition systems (Brader et al., 2007; Querlioz et al., 2013; Diehl and Cook, 2015; Zhao et al., 2015; Kheradpisheh et al., 2017). However, these works only account for the recognition of the static MNIST pictures and do not aim toward dynamic visual recognition of moving objects. An important part of the functioning of spiking neural networks is the dimension of time within the spike trains and on datasets that also have such a temporal dimension, spiking neural networks might be superior to classical artificial neural networks.

The NE15-MNIST database (Neuromorphic Engineering 2015 on MNIST, Serrano-Gotarredona and Linares-Barranco, 2015; Liu et al., 2016) that we used for our study is based on the original MNIST dataset. NE15-MNIST consists of four subsets that all aim to provide a benchmark for spike-based visual recognition. While the Poissonian and the FoCal subsets are synthetically generated from static MNIST images, the other two subsets are based on 128 × 128 pixel DVS recordings of the MNIST images. The MNIST-FLASH-DVS subset contains DVS recordings of MNIST digits that are flashed on a screen. Because we were interested in dynamic visual recognition of moving objects, we decided to work on the MNIST-DVS subset that consists of DVS recordings of MNIST digits that move back and forth across a screen and thereby produce temporal contrast and DVS events on the digits' edges.

The MNIST-DVS dataset is available online (Yousefzadeh et al., 2015). It consists of 30,000 recordings of 10,000 original MNIST digits recorded at three different scales each (scale-4, scale-8, and scale-16). Each recording has a time length of about 2.5 s, during which the digit moves twice from a position at the bottom left of the middle of the screen to the top right and back. The files are provided in the jAER format (Delbruck, 2008) and the dataset includes Matlab scripts for a conversion to Matlab arrays and three kinds of data preprocessing: removal of a 75 Hz timestamp harmonic produced by the LCD screen, stabilization of the digits on the center of the screen and removal of the event polarity information.

Previous classification results on the MNIST-DVS dataset are shown in **Table 1**. Henderson et al. (2015) derive a new eventbased learning scheme and apply it to a layered feedforward spiking neural network, which is trained self-supervised for classification of the MNIST-DVS digits. Zhao et al. (2015) use a composite system, consisting of a convolutional spiking neural network for feature extraction and a network of tempotron neurons for spike-based classification. While these two systems are fully event-driven, Stromatias et al. (2017) use a combination of a spiking neural network and a conventional artificial neural network. A convolutional SNN is used to capture the temporal dynamics of the DVS data and create a new, frame-based dataset, which is fed into a fully-connected artificial neural network. The supervised learning itself then takes place in this non-spiking network, using a stochastic gradient descent algorithm. In our concluding remarks we suggest how this approach could be combined with our model to maintain the high classification accuracies while providing greater biological plausibility.

# Model Design and Implementation

The only preprocessing we applied to the data was the removal of the 75 Hz timestamp harmonic. Stabilizing the video data would have been contrary to our intention to develop a system for dynamic visual recognition, and in fact, preliminary experiments suggested that the system would perform better on the original unstabilized videos. To run our spike encoding algorithm on the data, we used the script provided with the dataset to convert the jAER files into Matlab arrays.

The pooling of the DVS spikes into 128 input spike trains (ganglion cells) for the SNN, as described within section The Proposed System Architecture, remained the same throughout all experiments. Inside the spike encoding algorithm, only those four thresholds were changed that determine how many pixels within the receptive field of a ganglion cell must fire within one time step to make the ganglion cell itself emit a spike. As a first step, we wanted to find out how the system would perform differently when these thresholds and, thus, the average spike rate of the input data for the SNN, were changed. As described in section Firing Mechanism, the ganglion cells' receptive fields decrease from the periphery toward the center. Starting from the periphery, ganglion cells in group 1 integrate the signal of 32 × 32 = 1.024 DVS pixels, cells in group 2 from 16 × 16 = 256 pixels, cells in group 3 from 8 × 8 = 64 pixels, and cells in group 4 from 4 × 4 = 16 pixels. Assigning the same percentage threshold to all four groups would result in very low or no activity in the peripheral ganglion cells, e.g., with a threshold of 10% it would take only two DVS events within the receptive field of a ganglion cell in group 4 to trigger a spike, but 103 DVS events within the receptive field of a ganglion cell in group 1. Especially with the MNIST-DVS dataset, where DVS events only occur at the edges of the moving digits and not in larger blobs, this would make the peripheral ganglion cells redundant. On the other hand, increasing the thresholds too much from group to group toward the center would put more emphasis on the peripheral parts of the video than intended.

We carefully watched the MNIST-DVS videos and compared the distribution of DVS events with the average spike rates for the groups of ganglion cells that were produced by different spiking thresholds. We found that increasing the percentage thresholds by a factor of two from group to group toward the center would preserve the distribution of DVS events relatively well and not put too much emphasis on any single group. **Figure 6** shows the average spike rates for 1,000 scale-8 videos (100 per digit), produced by thresholds of 0.5% for group 1, 1% for group 2, 2% for group 3 and 4% for group 4. Since time is discrete in our model, we measure the average spike rates in %, dividing the number of time steps in which a cell fired by the total number of time steps. Most spikes occur in groups 2 and 3, consistent with the general distribution of DVS events in the scale-8 videos. The total spike average of the samples shown in **Figure 6** is 27.57%.


We altered the thresholds to get clearly distinguishable total spike averages. **Table 2** shows four different choices of thresholds, resulting in average spike rates of roughly 7, 14, 26, and 32% (exact numbers vary between different video scales). The last row represents the maximal achievable average spike rate with a threshold of 0% for each group. In that case, every ganglion cell fires if there is at least one DVS event in its receptive field at a given time step.

The mapping of the input spikes into the SNN NeuCube was done according to the proposed retinotopic mapping and it remained the same throughout all experiments. In all experiments NeuCube was initialized with 732 leaky integrate and fire neurons (LIF), representing the primary visual cortex. For future experiments with higher video resolutions and more input neurons, NeuCube can easily be extended to include neurons that represent the secondary and the tertiary visual cortex. Initial connections are formed following "small-world" connectivity with random connections within a predefined maximum distance from each neuron. This maximum distance was set to 2.5 in all experiments.

As described previously, unsupervised learning using STDP is performed first to learn spatio-temporal patterns by forming new connections between neurons, before the output classifier is trained in a supervised manner using the dynamic evolving SNN (deSNN) algorithm (Kasabov et al., 2013). The NeuCube architecture is a stochastic model and, therefore, sensitive to parameter settings. To find the best values for the major parameters that influence the system's performance, we applied a grid search method that tests the system on different combinations of parameters within a predefined range and used those parameter values that resulted in the best classification accuracy. For the firing threshold, the refractory time and the potential leak rate of the LIF neurons we used values of 0.5, 6, and 0.002, respectively. The STDP learning parameter was set to 0.01. The variables Mod and Drift of the deSNN classifier were set to 0.8 and 0.005. See Kasabov and Capecci (2015) for a more detailed explanation of these parameters.

#### Experimental Results

To compare the system's performance, we performed 10-fold cross-validation on 1,000 videos (first 100 of each digit), with 900 videos used for training and 100 for testing in each fold, for different video scales and average spike rates. **Table 3** summarizes the results. As a general trend, with few exceptions, the classification accuracy increased together with the average spike rate of the input neurons. For all video scales, the classification accuracy also increased when the system was run on all 10,000 videos of a given scale. The best classification results were achieved with all 10,000 videos of one scale, encoded with the highest possible spike rate (0% as spike encoding threshold for all four groups). Classification accuracies were 90.56, 92.03, and 86.09% % for scale-4, scale-8, and scale-16, respectively. The best accuracy in a single run with 90% of randomly selected data samples for training and the remaining 10% for testing was 92.90% for 10,000 scale-8 videos with the highest possible spike rate. This result is comparable to previous results on the MNIST-DVS dataset, presented in **Table 1**.

The lower accuracies on the scale-4 and the scale-16 samples reflect the fact that in these videos, the MNIST digits fill out either the whole screen (scale-16) or only a very little region in the center (scale-4). For the scale-4 digits, the signals transmitted by ganglion cells from groups 1, 2, and 3 are mostly noise and do not contain much information about the digits. In the scale-16 videos, there is almost no activity in the central region of the screen and, thus, no information is transmitted by the 64 foveal ganglion cells. Since our method puts heavy emphasis on the center of the video (50% of the input neurons represent data from only the central 64 pixels), performance on the scale-16 videos is lower.

## Model Interpretation for a Better Understanding of the Processes Inside the Visual Cortex

The main purpose of the above experiments, carried out on the MNIST-DVS dataset, is to confirm the system's classification performance on a benchmark dataset, and the moving digits do not represent a real-life scene. However, we want to show how the SNN can be analyzed after being trained, to see how its connectivity changes in response to the data. **Figure 7** compares the connectivity of the SNN before and after unsupervised training on 1,000 scale-4 videos with the highest possible spike rate. Blue and red lines represent positive and negative connections, respectively. We can notice that some of the randomly created initial connections disappear during the

training process. Instead, many new negative connections are created, mostly between neurons in the region that represents the posterior part of the primary visual cortex, where signals from the foveal ganglion cells arrive. Some of the new connections connect neurons over a long distance, especially in the very posterior part of the SNN, where a gap between neurons prevents the initial formation of "small-world" connections. As can be seen in **Figure 5**, the neurons on both sides of this gap represent adjacent DVS pixels, and by bridging this gap, the new connections allow for communication between these neurons. A comparison with the connectivity after training the SNN on 1,000 scale-16 videos shows that slightly fewer connections are formed between neurons processing foveal information since the scale-16 videos contain less DVS events in the foveal region. This effect is due to the acquisition hardware used and could be compensated for by the simulation of saccadic eye movements inside the encoding algorithm. In a biological retina, these rapid eye movements ensure that the fovea centralis focuses on salient features instead of constantly covering a less important area of the visual field. We discuss this possible improvement of the encoding algorithm in the next section.

There is also a visible difference between connections created for different digits. **Figure 8** shows the status of the network after


TABLE 2 | Different choices of spike thresholds within the spike encoding algorithm and corresponding average spike rates.

unsupervised training using only digits 1, 5, and 8, respectively. Interestingly, the connections created for digits 5 and 8 look similar, just like the digits themselves have a similar shape. The connections created after training on digit 1, on the other hand, look distinctly different. We can, therefore, conclude that the visual characteristics of the digits are preserved in our system, just like they are in the human visual cortex.

# DISCUSSION OF THE SYSTEM'S ADVANTAGES AND LIMITATIONS

The proposed system achieves a classification performance on the benchmark MNIST-DVS dataset that can keep up with previous works on this dataset and is superior to those works that used a spiking neural network classifier. Every part of the system, the DVS sensor, the algorithm for encoding the DVS output into spike trains, and the SNN NeuCube adopt features from the human visual system. This allows for future experiments where the same stimuli are presented to humans and the proposed system and brain processes visualized by neuroimaging methods can be compared to the network processes of the SNN, which can be easily visualized within the NeuCube architecture.

Another advantage of the proposed system is the high flexibility of the SNN's three-dimensional structure. The NeuCube architecture is not restricted to consist of neurons that represent only the visual cortex. For example, one could map aural stimuli to input neurons representing the auditory cortex, to obtain a model that processes aural and visual information at the same time in a brain-like way. The integration of other kinds of data, such as tactile or olfactory information, within a multimodal model is conceivable as well.

We found that the system's classification performance increases together with the average spike rates of the 128 input neurons. To account for the findings of Berry et al. (1997) in retinal ganglion cells of rabbits and salamanders, we started our experiments with low spike rates of approximately 5%, but the classification accuracies were very low in these cases. However, the reported firing rates of rabbit and salamander ganglion cells were measured during the presentation of random flicker, which might yield very different firing behavior than stimuli like the moving digits. Single cell recordings of retinal ganglion cells could provide more evidence about the firing rates under specific stimuli. The parameters of the spike encoding algorithm that determine the average spike rates can then easily be tuned to TABLE 3 | Results of 10-fold cross validation for different video scales and average spike rates.


mimic the behavior of real retinal ganglion cells and it would be interesting to see if classification accuracy increases when the average spike rates conform to the biological evidence.

Since so much is known about the human visual system and we aimed to develop a biologically plausible, yet computationally feasible implementation, there are many details not included in our model. There already exist very advanced mathematical models for the function of retinal ganglion cells (Wei and Ren, 2013) and our spike encoding algorithm has by far not touched every detail of them. The receptive field of each ganglion cell, for example, is split into a center region and a surrounding region with opposite behavior toward light (Nelson, 1995). In so-called on-center cells, the center region is stimulated, whereas the surrounding region is inhibited when exposed to light. Socalled off-center cells exhibit converse behavior. Including the function of on- and off-center ganglion cells inside the spike encoding algorithm would highly increase the model's biological plausibility, but also its computational complexity. Another computational restriction of our model is that the random initial creation of excitatory and inhibitory connections causes a violation of Dale's Principle, which states that all axonal branches of a neuron perform the same chemical reaction.

One shortcoming of the DVS when compared to the human retina is its inability to process colors. The DVS only encodes temporal changes in brightness that signal motion (Delbruck, 2008), similar to the rod photoreceptors on the retina and the functionality of the magnocellular fibers in the optical nerve (Purves, 2012). However, the cone photoreceptors on the retina as well as the comparatively large amount of parvocellular fibers in the optic nerve are not modeled by the DVS despite their importance for detecting and transmitting information about color and details of the perceived objects (Purves, 2012). This means that all object recognition approaches using DVS input are

somewhat limited because the DVS only captures signals that the human visual system would use to detect motion and distances to objects, but not those signals necessary for recognizing objects and details.

The proposed system puts strong emphasis on the central part of the videos in both the encoding of DVS events to spike trains and the representation inside the SNN. This is justified by analogous features of the fovea centralis in the center of the human retina, responsible for focused vision. However, there is no evidence that there exist retinal ganglion cells with large receptive fields in the human retina that cover the fovea centralis in a redundant manner as in our system. Further, our system does not account for the very fast and simultaneous movement of human eyes, called saccades. Saccades help to scan a broader part of the visual field with the fovea and integrate this information into a detailed map (Purves, 2012). Human eye movement is also controlled by the visual grasp reflex that directs the eyes toward salient events in the periphery of the visual field (Monsell and Driver, 2000). These mechanisms for eye movement could be implemented in the spike encoding algorithm by changing the coordinates for the pooling of DVS pixels for each time step, and thereby virtually moving the center of the visual field. However, this would require additional features to save the movement and integrate it into the SNN.

# CONCLUSION

This paper presents a new methodology for dynamic visual recognition, inspired by different features of the human visual system. The proposed system is designed to take data from a DVS silicon retina and encodes them into spike trains using an algorithm that mimics the organization and function of retinal ganglion cells. The spike trains are then fed into the brain-like SNN NeuCube, following the retinotopic mapping of photoreceptors from the retina into their neural representations in the primary visual cortex. Two stages of learning, unsupervised and supervised, are performed by NeuCube to extract spatiotemporal patterns from the data and perform a classification task. Results on the benchmark MNIST-DVS dataset have shown that the system can keep up with the classification performance of other methods for dynamic visual recognition. Furthermore, it is possible to dynamically visualize and analyze the activity inside the SNN for a better understanding of the data and the process of their deep learning in the model.

Due to the promising benchmark results and the benefit of the visualization tools for an in-depth understanding of the data and the network processes, we endorse further research on the system. In particular, we suggest the exploration of new learning methods inside NeuCube and of different algorithms for the encoding of DVS data into spike trains.

To date, the highest classification accuracy on the MNIST-DVS dataset has been achieved by Stromatias et al. (2017), who used a spiking convolutional neural network to create a new frame-based dataset, which captures the dynamics of the DVS output and serves as input for a fully-connected classifier that uses stochastic gradient descent. The non-spiking classifier is then mapped to a spiking output layer of LIF neurons. As they mention in their paper, the non-spiking classifier and the spiking output layer can be used with any spiking neural network that has already extracted features from the data in an unsupervised manner. We propose to explore how the connectivity or spiking activity of the NeuCube after the unsupervised learning stage could be used to create a similar frame-based dataset, and how the classifier used by Stromatias et al. (2017) would perform on such a dataset. This way, the biological plausibility of our model could be combined with current state-of-the-art classification algorithms.

We also encourage the development of further benchmark datasets for spike-based visual recognition, e.g., spiking versions of the KTH and the Weizmann datasets of human actions (Laptev and Caputo, 2005; Gorelick et al., 2007). Since the NeuCube architecture is not bound to only consist of neurons representing the visual cortex, future directions can include the integration of our system for visual recognition inside a broader, multimodal methodology, e.g., for the biologically plausible processing of visual and aural data at the same time within the same system. The used DVS format for visual data encoding into spike trains is not a restriction for the proposed SNN method for retinotopic

#### REFERENCES


mapping. Learning and other encoding methods for different types of visual data are envisaged to be explored in the future.

#### AUTHOR CONTRIBUTIONS

LP the main author, contributes to the spike encoding algorithm, the retinotopic mapping into NeuCube, the choice of MNIST-DVS as a benchmarking dataset, performance evaluation, and paper writing. AW contributes to the initial design of the NeuCube model and partial implementation, and takes part in discussions and reviewing the paper. NK originated the initial idea of this project, and takes part in discussions and reviewing the paper.

#### ACKNOWLEDGMENTS

The authors thank the reviewers for the useful comments and suggestions. NK acknowledges his discussions with Giacomo Indiveri, Tobi Delbrück and other colleagues from INI, ETH/UZH during his Marie Curie visit in 2011/2012 and the contacts afterwards. AW is funded by a scholarship from Auckland University of Technology, and LP by a scholarship from the Baden-Württemberg Foundation for his visit to the Knowledge Engineering and Discovery Research Institute (KEDRI) at Auckland University of Technology. The authors would further like to thank Dr. Josafath Israel Espinosa Ramos for his valuable support.


Purves, D. (ed.). (2012). Neuroscience. Sunderland, MA: Sinauer.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Paulun, Wendt and Kasabov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Model for a Filling-in Process Triggered by Edges Predicts "Conflicting" Afterimage Effects

Hadar Cohen-Duwek\* and Hedva Spitzer

*Vision Research Laboratory, School of Electrical Engineering, Tel-Aviv University, Tel-Aviv, Israel*

The goal of our research was to develop a compound computational model that predicts the "opposite" effects of the alternating aftereffects stimuli, such as the "color dove illusion" (Barkan and Spitzer, 2017), and the "filling in the afterimage after the image" (van Lier et al., 2009). The model is based on a filling-in mechanism, through a diffusion equation where the color and intensity of the perceived surface are obtained through a diffusion process of color from the stimulus edges. The model solves the diffusion equation with boundary conditions that takes the locations of the chromatic edges of the chromatic inducer (chromatic stimulus) and the achromatic remaining contours into account. These contours (edges) trigger the diffusion process. The same calculations are done for both types of afterimage effects, with the only difference related to the location of the remaining contour. While a gradient toward the inducing color produces a perception of the complementary color, an opposite gradient yields the perception of the same color as that of the chromatic inducer. Furthermore, we show that the same computational model can also predict new alternating aftereffects stimuli, such as the spiral stimulus, and the averaging of colors in alternating afterimage stimuli described by Anstis et al. (2012). The suggested model is able to predict most of the additional properties related to the "conflicting" phenomena that have been recently described in the literature, and thus supports the idea that a shared visual mechanism is responsible for both the positive and the negative effects.

Keywords: afterimage effects, filling-in, diffusion, visual system mechanism, computational model

# INTRODUCTION

This study concerns two non-classical afterimage illusions, both involving a chromatic stimulus i.e., a chromatic inducer that is presented for a short duration of time, and is then followed by the presentation of an achromatic remaining contour that may overlap with the inner or outer border of the chromatic region of the inducer. The location of this remaining contour, can determine whether the perceived filling-in color will be the same as, or complementary to, the chromatic inducer. Two famous examples of these phenomena are: the "Filling-in the Afterimage after the image effect" (van Lier et al., 2009), and the color dove illusion (Barkan and Spitzer, 2009, 2017; Macknik and Martinez-Conde, 2010). Both phenomena involve a filling-in process of surfaces between edges, and the effects are obtained with a narrow spatial inducing area and relatively short induction time. Since these two phenomena yield complementary perceived colors, derived from the very same inducer, we refer to them as "conflicting" effects.

#### Edited by:

*Qasim Zaidi, University at Buffalo, United States*

#### Reviewed by:

*Greg Francis, Purdue University, United States Jihyun Yeonan-Kim, National Institutes of Health (NIH), United States*

> \*Correspondence: *Hadar Cohen-Duwek hadarli@gmail.com*

#### Specialty section:

*This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience*

Received: *04 January 2018* Accepted: *25 July 2018* Published: *17 August 2018*

#### Citation:

*Cohen-Duwek H and Spitzer H (2018) A Model for a Filling-in Process Triggered by Edges Predicts "Conflicting" Afterimage Effects. Front. Neurosci. 12:559. doi: 10.3389/fnins.2018.00559*

In the "Filling in the Afterimage after the image" (van Lier et al., 2009) illusion, the inducing stimulus is a chromatic shape that may have two or more colors. After the chromatic inducing stimulus is removed, an outline contour matching one of the shape colors is presented. The complementary afterimage color perceived depends on the shape and the location of the drawn outline contour (van Lier et al., 2009), (**Figure 1**, second column). Since the color inside the contour in the perceived afterimage is complementary to the color of the inducing stimulus, we henceforth, refer to this illusion as a "negative effect."

It should be noted that this negative effect is not a simple variation of the "classical" negative afterimage, where, when a stimulus is removed after a relatively long (20-30 seconds) exposure, the observer perceives the opposite chromaticity (complementary color DeValois and Webster, 2011). It should also be noted that the colors in the classical afterimage are perceived only in the retinotopic area that was induced.

In the color dove illusion (Barkan and Spitzer, 2009, 2017), the inducing stimulus is a shape surrounded by a colored area or strip (red in **Figure 1**, first row). After the chromatic inducing stimulus is removed, an outline contour matching the original inducing stimulus is presented (**Figure 1**, second row). This gives rise to the perception of an afterimage (**Figure 1**, third row) filled with a color similar to that in the inducing stimulus (although weaker), and not the complementary color as in the negative effect. Such an effect has also been reported with objects of different shapes (Hazenberg and van Lier, 2013). Since the perceived color inside the shape is similar to that presented in the inducing stimulus, we henceforth refer to this illusion as a "positive effect," (**Figure 1**, first column).

A similar positive aftereffect was previously investigated by Anstis et al. (1978) who suggested that the positive chromatic afterimage effect is a result of the synergy of two known visual mechanisms: simultaneous contrast (Gerrits and Vendrik, 1970; Anstis et al., 1978) and colored afterimage (Daw, 1962; Wyszecki, 1986; Shimojo et al., 2001).

The alternating effects differ from a classical afterimage in their temporal and spatial properties. A classical afterimage requires a relatively long exposure time and a large spatial area of induction, in order to obtain a filling-in effect in a small region with the complementary color (Anstis et al., 1978). In the phenomena described here, preliminary results indicate that the positive effect is not abolished even if the area of the chromatic inducer is spatially thin (Hazenberg and van Lier, 2013; Barkan and Spitzer, 2017. This is in contrast to the explanation given by Anstis et al. (1978), since psychophysically, when the area of a chromatic inducer is thin, the effect of simultaneous contrast is not manifested (preliminary results). The positive and the negative effects are also distinguished from the classical aftereffect (Anstis et al., 1978), in their temporal properties. The duration of the alternating stimuli can be very short (500 ms), a period of time that is insufficient to obtain the classical afterimage effect (Anstis et al., 1978; van Lier et al., 2009; Barkan and Spitzer, 2017).

A further distinguishing characteristic of these phenomena is that, in addition to the temporal and spatial differences from the classic afterimage effect, the color in both the positive and negative effects is perceived in new areas that have not been induced or adapted previously (van Lier et al., 2009). It has to be noted that even though the positive and the negative effects share several common properties, they are still phenotypically different and therefore they can be seen as "conflicting effects."

Hazenberg and van Lier (2013) investigated "alternating watercolors," which have the spatial and the chromatic structure as of the classical watercolor stimuli. These types of stimuli can be considered as the positive and the negative stimuli, while the same classical watercolor stimulus is used as the chromatic inducer stimuli for both positive and negative aftereffects. In this case, the remaining contours are located at the inner or the outer contours of the chromatic edges of the inducer stimulus. The reported results (Hazenberg and van Lier (2013) indicated that the positive and negative effects were affected differently by a number of parameters including the luminance of the area inside the shape and the luminance of the remaining contour.

At present, the visual mechanisms responsible for the recently described positive and negative effects are still unknown and there are no successful computational models for the phenomena. This is less surprising in view of the fact that there remains a lack of consensus concerning the mechanism of even the classical afterimage, despite the wealth of research in the literature. The physiological mechanisms commonly proposed as responsible for the classical negative afterimages range from bleaching of cone photo-pigments to cortical adaptation (Williams and Macleod, 1979; Shimojo et al., 2001; Clair et al., 2007; van Lier et al., 2009; Zaidi et al., 2012; Webster, 2015; Zeki et al., 2017). A recent paper suggested a different mechanism to the van Lier et al. (2009) effect and attributed the filling-in process to the perception of transparency cue and cortical mechanisms (On and van Boxtel, 2017).

Additional recent research (Zaidi et al., 2012) has suggested that the classical and the negative afterimage effects are derived from the retinal ganglion mechanism, which yields the neuronal rebound effect. According to this mechanism, the ganglion

neurons can fire bursts if inhibited and then released from inhibition (Spitzer et al., 1993; Grunfeld and Spitzer, 1995; Francis, 2010; Zaidi et al., 2012). It should be noted that while the rebound effect may modulate the creation of complementary colors, it cannot be responsible for the either the negative or positive effects in their entirety.

Previous computational models have been reported to describe both the complementary perceived color and the filling-in components (Grossberg and Todorovic, 1988; Francis and Rothmayer, 2003; Francis and Ericson, 2004; Francis and Schoonveld, 2005; Wede and Francis, 2006, 2007; Van Horn and Francis, 2008) . These models were based on the original "Form And Color And Depth" FACADE) model (Grossberg and Mingolla, 1985), which described two main visual processing systems: a boundary contour system (BCS) that processes boundary or edge information, and a feature contour system (FCS) that uses information from the BCS to control the spreading (filling-in) of surface properties, such as color and brightness. According to the FACADE model, the filling-in stage requires the FCS networks to diffuse signals containing feature information about color and brightness across the surface, while boundaries in the BCS block the spreading.

The FACADE model and its variations succeed in predicting the afterimage effects of the MacKay modal complementary afterimages (MCAI) phenomena (MacKay, 1957; Vidyasagar et al., 1999). This effect involves sequential viewing of two orthogonally related patterns (the first one a constant pattern and the second one a flickering contrast reversal pattern). The result is an afterimage percept that is related to the first pattern (Francis and Rothmayer, 2003; Francis and Ericson, 2004; Francis and Schoonveld, 2005; Wede and Francis, 2006, 2007; Van Horn and Francis, 2008). A number of studies have examined the different spatial and temporal properties of the MCAI effect, for example the spatial and temporal frequency of the two gratings from the first and second presentations (Francis and Rothmayer, 2003), the gap width (Francis and Ericson, 2004), the split gratings (Francis and Schoonveld, 2005), duration between the two grating presentations and the blank presentation (Wede and Francis, 2006), attentional properties (Wede and Francis, 2007), and the role of the difference orientations of the constant and the flickering grating (Van Horn and Francis, 2008). Francis and colleagues confronted their computational model's prediction with the perceived results.

It should be noted that the MCAI and its variations discussed in these Francis papers are not necessarily related to the positive and negative aftereffects phenomena described in our current report. The main differences between the MCAI (MacKay, 1957; Vidyasagar et al., 1999) phenomena and the positive and the negative effects concern the different types of the stimulus components, at these two groups of effects. The stimulus differences related to the orientation gratings and contrast reversal flickering patterns used to produce the MCAI effect versus the chromatic shape of inducer and remaining contour that trigger the positive and negative effects. These differences in the type of stimuli might imply distinct mechanisms that involve additional different components, even though both models can basically be attributed to diffusion processes.

Francis (2010) applied a similar diffusion model to that described previously in Francis and Rothmayer (2003) in order to address the negative effect of van Lier's illusion (van Lier et al., 2009), and succeeded with the model's predictions. At a later stage, Kim and Francis (2011) conducted a series of psychophysical experiments designed to prove that a simple diffusion model (Francis, 2010) cannot account for the additional properties characterize the negative after effect. They tested the hypothesis, for example, that a contour traps the perceived afterimage color, by adding additional remaining contours. Their model simulations predicted that these additional remaining contours would block the spread of a color to the middle of the surface, **Figure 4**.

However, contrary to Francis's predictions (Francis, 2010), the results of the psychophysical experiments showed that additional remaining contours blocked color spreading only when they overlapped with the inducer edges, but not when they were drawn away from the inducer edges (Kim and Francis, 2011), **Figure 4**. More important to our discussion is the fact that FACADE model did not and cannot model the positive effect. In this study, we present a computational model that can predict both the negative and the positive effects, and postulate that these effects are derived from the same mechanism. We also test whether the model can predict additional afterimage phenomena beyond the two described effects.

# MODEL

The following sections describe a unified computational model that can predict the two known "conflicting" (opposite) phenomena, the positive "color dove illusion" and the negative "filling-in afterimage after the image" illusion. The model is also able to predict additional variations of the positive and the negative effects. We suggest here, that despite differences in their spatial and temporal properties, these two types of phenomena are produced by a very similar (mutual) mechanism. The model considers several crucial factors for the perceived temporal effects and these are presented in **Figure 2**.

# Model Assumptions

The model is based on the following assumptions: (a) An edge triggers a diffusion process in its complementary color. (b) A contour can be a perceived contour and not necessarily a physical spectral gradient. (c) The diffusion process depends on the correspondence between the chromatic stimulus gradients and the remaining contours. (d). The positive and the negative effects are always present, while the dominant perceived color is determined by the location of the remaining contours.

# The Stimulus: The Chromatic Inducer and the Remaining Contours

The input of the model is composed of two temporal components, the first one is a chromatic stimulus, I<sup>0</sup> in **Figure 2**, and the second one relates to the remaining contours I1<sup>a</sup> and I1b in **Figure 2**. The remaining contours can appear in different possible locations, and these locations determine whether the perceived result will be a positive effect or a negative effect.

#### Chromatic Gradients

The building blocks of the model are designed to simulate components of the visual system, and in this case, the opponent and double-opponent receptive fields. The color coding opponent receptive fields encode color contrast, but not spatial contrast. In other words, the color opponent receptive fields are able to differentiate between colors, but cannot detect spatial gradients or edges (Barkan et al., 2008). The double opponent receptive fields, however, are sensitive to both spatial and chromatic gradients and have color opponent receptive fields both at the center and in the surround receptive field regions (Shapley and Hawken, 2011). This opponency in both spatial and chromatic properties produces a spatio-chromatic edge detector.

For the sake of simplicity, we compute the opponent response of the opponent receptive fields as color-opponent only, where, in this simplified case, each chromatic encoder contains the same spatial resolution. This is computed by an opponent color-transformation (Sande et al., 2010), Equation (1). This transformation converts each pixel of the image I0, in each chromatic channel R,G, and B into opponent color-space, via the transformation matrix O (Sande et al., 2010). IOPPONENT = OPPONENT{RGB} as follows:

$$I\_{\text{OPPONENT}} = \begin{pmatrix} O\_{RG} \\ O\_{YB} \\ O\_{BW} \end{pmatrix} = \begin{pmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} & 0 \\ \frac{1}{\sqrt{6}} & \frac{1}{\sqrt{6}} & \frac{-2}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} \end{pmatrix} \begin{pmatrix} R \\ G \\ B \end{pmatrix} \tag{1}$$

where ORG, OYB, OBW are the new channels of the transformed image IOPPONENT. R, G, and B are the red, green and blue channels of I, respectively.

In order to implement the double-opponent response, DO, on an image, we subtract the surround, Osurround, region of the receptive fields from its center, Ocenter, at the same spatial location:

$$DO = \begin{array}{c} \text{O}\_{center} \ - \text{ O}\_{surround} \end{array}$$

The structure of the double opponent receptive field can be seen as a filter which performs as a second derivative in both spatial and chromatic domains (Conway, 2001; Conway and Livingstone, 2006). For the sake of simplicity and clarity of the calculations, we use a discrete Laplace operator, L, which is commonly used as an approximation to the Difference of Gaussian (DOG) function (Marr, 1982). The discrete Laplace operator, L is (Weickert, 1998):

$$\mathcal{L} = \begin{pmatrix} 0 \ \frac{-1}{4} \ 0 \\ \frac{-1}{4} \ 1 \ \frac{-1}{4} \\ 0 \ \frac{-1}{4} \ 0 \end{pmatrix} \tag{2}$$

The responses of the relevant receptive fields, DOresponse, of the color coding receptive fields to the aftereffect stimuli are presented in Equation (3). The double-opponent DOresponse response is calculated as a convolution of each opponent channel of IOPPONENT with the discrete Laplace operator Equation (2).

$$DO\_{response} \text{(stimulated on)} = \begin{array}{c} \nabla^2 I\_{\text{OP}} \approx I\_{\text{OP}} \ast L \end{array} \tag{3}$$

**Figure 2B** demonstrates the responses of the receptive fields to the original stimulus (**Figure 2A**) at time t0, Equation (3).

#### The Perceived Gradients—The Responses of the Receptive Fields to the Aftereffects

The model suggests that after the chromatic stimulus disappears, the chromatic gradients obtain the opposite sign. We refer to this condition as "off response," a term commonly used in electrophysiology (Kandel et al., 2012). The physiological mechanism behind this behavior is still a matter of discussion (Williams and Macleod, 1979; Spitzer et al., 1993; Shimojo et al., 2001; Clair et al., 2007; van Lier et al., 2009; Francis, 2010; Zaidi et al., 2012; Webster, 2015; Zeki et al., 2017). This response has also been termed the rebound response and a variety of models and mechanisms have been suggested to explain how this rebound phenomenon yields a reversed type of response (Spitzer et al., 1993; Grunfeld and Spitzer, 1995; Francis, 2010; Zaidi et al., 2012). **Figure 2B** demonstrates the responses of the simulated receptive fields before and after the chromatic stimulus is removed at times t<sup>0</sup> and t1, Equation (4).

$$DO\_{response}(\text{stimulus } \text{off}) = \ I\_{OP} \* (-L) \tag{4}$$

In other words, in this case, the sign of the chromatic gradient, DOresponse, is reversed. Note that the disappearance of the chromatic stimulus, which causes the sign of the edge to be reversed, is in accordance to the model's assumption (section Model Assumptions, A). There are also experimental results that support this assumption (Zaidi et al., 2012).

This operation of edge reversal is realized in the model through reversing the sign of the DO receptive field responses, Equation (3). This reversed chromatic gradient triggers the diffusion process, **Figure 2C**, Equation (5).

#### Filling-in as a Diffusion Process

The diffusion process is expressed by the diffusion (or heat) Equation (5), (Weickert, 1998). The model assumes that the suggested diffusion of the filling in process is similar to the physical diffusion where the signals spread in all directions, until "blocked" by contours or edges. This type of filling-in process is referred in the literature as the "isomorphic filling-in theory" (von der Heydt et al., 2003). The choice of such a type of filling-in infers that the borders (chromatic or achromatic) do not function primarily as blockers, but instead that the borders play a role as heat sources for the diffusion. When the direction of the diffusion spread is in the opposite direction (colliding) to that of an additional heat source, the spread will actually be blocked by the heat source. These principles are applied in our model through the famous diffusion equation (Weickert, 1998), as in the following equation:

$$\frac{\partial I(\mathbf{x}, \mathbf{y}, t)}{\partial t} - D \nabla^2 I\left(\mathbf{x}, \mathbf{y}, t\right) = \hbar\_c \tag{5}$$

where I x, y, t denote the image in a space-time location x, y, t , D is the diffusion (or heat) coefficient, and h<sup>c</sup> represents a heat source. The time course of the perceived image is assumed to be very fast, in accordance with previous reports (van Lier et al., 2009; Barkan and Spitzer, 2017). This time course is also termed "immediate filling-in" (von der Heydt et al., 2003).

Following this assumption, for the sake of simplicity, we can ignore the fast dynamic stages of the diffusion equation, and therefore compute only the steady-state stage of the diffusion process. Consequently, the diffusion (heat) Equation (5) is reduced to the Poisson Equation (6).

x, y, t = −h<sup>c</sup> (6)

# THE CHROMATIC EDGES AND THE REMAINING CONTOURS

∇ 2 I 

In order to maintain and enhance and/or byproduct to trap this diffusion effect there is a "requirement" for a border. The model suggests that the chromatic diffusion can be "trapped" only when the achromatic remaining contour, ∂<sup>1</sup> **Figure 2A**, overlaps the original edges of the chromatic stimulus, DOresponse. Support for this assumption is also provided from the psychophysical results of Kim and Francis (2011).

Whether the reminding contour ∂1, is an inner or an outer contour, for example (**Figure 2**), determines the perceived color of the effect. When the remaining contour is the outer contour, the reversed contour, i.e.; the complementary contour, [**Figure 2A**, Equation (4)] triggers a diffusion color that is complementary to the color of the inducer, i.e. red in the specific case of **Figure 2B**. The outer contour, ∂1, determines that the fill-in color will be complementary to the inducer (negative effect), whereas the inner contour, ∂2, determines that the fillin color will be the same color as that of the inducer (positive effect). It has to be noted that the mechanism detects the chromatic edges, and does not treat the inner or outer edges separately. The configuration and the locations of the remaining contours, and not the model, determine the predicted perceived colors.

It is clear that a remaining contour that overlaps the chromatic gradient plays a role as a diffusion trigger and at the same time as a "blocker." However, our preliminary results suggest that the original chromatic gradient, DOresponse, also plays a role as a diffusion trigger and "blocker," even though it has a weaker effect when it does not overlap the remaining contours. This observation is also supported by findings of Hazenberg and van Lier (2013). They concluded that the chromatic border in the negative effect "apparently prevented the colored afterimage of the chromatic contour from spreading."

This minor effect of additional blockage, derived from the chromatic edges, has been integrated into the model by applying different weight functions to each chromatic and achromatic border. The model assumes that the remaining contour also plays a role as an enhancer to the reversed chromatic edges, −DOresponse. Therefore, if the remaining edge, ∂1, overlaps the original gradient edge (the chromatic gradients of the inducing stimulus, −DOresponse), it will enhance these chromatic edges. The mathematical expression of this role is expressed by the weight functions α and β:

$$
\nabla^2 \mathcal{O}\_{\mathcal{P}} = -D \mathcal{O}\_{\text{response}} \cdot \left(\alpha \partial \mathcal{Q}\_1 + \beta\right), \text{ where } \alpha > \beta \tag{7}
$$

where O<sup>p</sup> is the perceived image in the opponent color-space (Sande et al., 2010) and α and β are constants, but can be further extended to be functions.

Solving Equation (7) yields a response to the perceived afterimage O<sup>p</sup> given the reversed gradients −DOresponse x, y, i , Equation (4), according to specific initial constraint. **Figure 2D** represents the perceived afterimage, Op, but with an additional technical stage of transforming the opponent color space O<sup>p</sup> to the RGB color space, IP(rgb) , Equation (11).

The interpretation of the solution as suggested above is that a very similar mechanism is responsible for both the negative and the positive effects, although it is possible that the two phenomena do not stem from the exact same visual mechanism. The model may separate the positive and the negative effects to two channels. One channel is for the chromatic area, where the negative effect is more dominant, while the other channel serves the achromatic area, where the positive effect is more dominant. Since the negative effect is given by a response from the chromatic induced region, whereas with positive effect there is a perceived response to an area that has not been induced with color, we assumed that the weight function of the negative effect should be higher than the positive effect (Equation 10). This separation can be justified by analogy to the visual system. The existence of separated Magno, Parvo, and Konio visual pathways in the visual system suggests that separating chromatic and achromatic calculations in this way may be a true reflection of the visual system processing (Shevell, 2003).

We implanted the two separated channels for the positive and negative effects by calculating the diffusion Equation (5), separately for the chromatic and achromatic zones in the original image (I0). The positive effect Op,positive occurs in the achromatic zones of the initial image I0, **Figure 2A** and the negative effect Op,negative occurs in the chromatic zones of the initial image I0, **Figure 2A.** Accordingly, the equation is solved separately for the negative effect Op,negative and for the positive effect , Op,positive, (see the section above). The simulation result is calculated as:

$$\nabla^2 O\_{p, \text{negative}}(\mathbf{x}, \mathbf{y}, \mathbf{i}) = -\text{DO}\_{\text{response}}(\mathbf{x}, \mathbf{y}, \mathbf{i}) \cdot (\alpha \,\partial \Omega\_1 + \beta) \,\text{in}\,\Omega\tag{8}$$

$$\nabla^2 O\_{\text{p,negative}}(\mathbf{x}, \mathbf{y}, \mathbf{i}) = -\text{DO}\_{\text{response}}(\mathbf{x}, \mathbf{y}, \mathbf{i}) \cdot (\alpha \,\partial \Omega\_1 + \beta) \,\text{in}\,\overline{\Omega} \tag{9}$$

i = RG,YB,WB, where each opponent channel is solved separately.

$$O\_P = \frac{O\_{p,positive} + O\_{p,negative}}{\max\_{all\\_channels} \left\{ I\_{p,positive} \right\} + \max\_{all\\_channels} \left\{ I\_{p,negative} \right\}} \tag{10}$$

$$I\_{P(rgb)} = OPPONENT^{-1} \left\{ O\_{\mathcal{P}} \right\} \tag{11}$$

where maxall\_channels {I} is the maximum value of all channels in the image I (max {I} is a scalar). α and β present the weights of the remaining contours and the chromatic stimulus edges, accordingly.

In order to calculate the perceived afterimage from both the negative Op,negative and the positive Op,positive effects, Equations (8–10), we need to define (a) boundary conditions, and (b) the initial values. We shall henceforth denote the inducing stimulus (the original color image) by I0, where is an area in the image I0, and ∂ is the border of , **Figure 2A**. I<sup>1</sup> is the remaining contour image and ∂1or ∂<sup>2</sup> are the remaining contours (the remaining boundaries, although the boundaries in I<sup>1</sup> might be different from those in I0. Therefore, the chromatic edges, ∂, do not necessarily overlap the remaining contours ∂1or ∂2). The boundary conditions of the perceived image I<sup>p</sup> and the initial state (initial conditions) are chosen to be an achromatic white color on the output image border. Thus, the boundary condition is Op|border = 1, **Figure 2A** and the initial image is a blank white image (R = G = B = 1). These conditions are selected in order to enable the generation of the perceived afterimage on a white image as in the original stimuli (Barkan and Spitzer, 2009; van Lier et al., 2009), **Figure 2D**.

#### RESULTS

#### Simulation Details

The simulations are produced by assigning the conditions (boundary conditions and initial values) as described above, and applying the Gauss-Seidel method. The simulations are solved in a similar way to that reported in "Methods for Solving Equations" (Simchony et al., 1990) or "Poisson Image Editing" (Pérez et al., 2003). The simulations are implemented by MATLAB software.

The only parameters in the model are α and β, which present the weights of the remaining contours and the chromatic stimulus edges, accordingly. The Parameters were chosen as following: α = 1.3 and β = 0.1, as results of trial and error. These parameters are constant for all the simulations; beside in the **Supplementary Figure 1** where we intended to slightly enhance the result for demonstration.

#### Model's Simulation and Predictions

The simulation results are divided into three parts. The first part presents the model predictions for both the negative and the positive afterimage phenomena, (Barkan and Spitzer, 2009; van Lier et al., 2009). The second part presents the predictions of the model for two remaining edge variations, as presented in previous studies (Francis, 2010; Kim and Francis, 2011). The third part presents the model predictions for additional aspects of the afterimage phenomenon, where one relates to the color perceived when the remaining edge of the image is not complete (open boundaries, spiral image), and the second relates to spatial averaging of colors, (Anstis et al., 2012).

#### Negative and Positive Afterimages

We tested the model on the same stimuli as in the study of van Lier et al. (2009) (**Figure 3** first row), and for the general case of the chromatic stimuli I0, **Figure 2**. **Figure 3** presents the model's predictions for a single colored ring as inducer (second and third rows). It can be seen that the model correctly predicts that the remaining contours can generate a negative or a positive effect depending on their location. Of note, the model correctly predicted the filling-in process of the achromatic area with respect to both negative and positive effects, with the results in accordance with the psychophysical findings reported previously (van Lier et al., 2009; Hazenberg and van Lier, 2013). Having different weight functions for the positive and negative

effects (Equation 11) enables us to control the predicted effect of a stronger diffusion in the inner than in the outer region of the remaining contours (**Figure 3**). These studies showed that the perceived afterimage has the complementary color when the outer contour is remained, (**Figure 3**, second row), while the same color is perceived when the inner contour is remained (**Figure 3**, third row).

#### The Role of the Remaining Edges Comparison to Previous Results

We also tested our model on the same variations of the van Lier et al. (2009), stars stimulus that were tested by Francis (2010) and Kim and Francis (2011). These variations are related to the location and shape of the remaining contour. **Figure 4** presents a comparison between the predictions of our model and that of Francis for two possibilities of drawn remaining contours, (**Figures 4C,D**, respectively). In one case, the remaining edges overlap the chromatic gradients (**Figure 4**, First row), which exist in the inducing stimuli, while in the second case, there is no overlap (**Figure 4**, second row).

The predictions of both models yielded the same results when the boundaries overlapped (**Figure 4**, first row, **C,D**), and these results agree with the experimental perceived results reported previously (Kim and Francis, 2011). However, the predictions of the models differed when the boundaries were non-overlapping. **Figure 4** second row shows that the inner rectangle is reddish (**Figure 4D**) according to our model, but gray according to the predictions of Francis' model's (**Figure 4C**). Notably, the psychophysical findings (Kim and Francis, 2011) support our model, which predicts that remaining contours that do not overlap the chromatic gradient, do not block the diffusion process.

#### Model Predictions for a New Stimulus With Different Variations of Remaining Edges

Having successfully tested our model on previously described stimuli, we proceeded to further challenge the simulations with new spiral stimuli, which have not been described previously or experimentally tested. The new stimuli can simultaneously generate both positive and negative effects because they have both inner and outer borders. This type of stimulus enables us to test a critical property regarding the effect of closed or open remaining edges, on the relevant aftereffects.

The model's results for the spiral stimuli, indicated that the dominant color perceived in the afterimage depends on whether the remaining edges are the inner or outer edge, (first and second rows of **Figure 5**, respectively). The dominant color, predicted by our model, can therefore be either complementary or similar to that of the inducer color, where the outer border produces a dominant positive effect, while the inner border produces a dominant negative effect, (**Figure 5C**). These predictions are supported by preliminary psychophysical results (Manuscript in preparation).

As a further test, we examined the ability of our model to predict the psychophysical results of the aftereffects that can be perceived from performing spatial averaging within the remaining contours (Anstis et al., 2012). This question was tested by our model simulation under two configurations representing variations of the negative and positive effects (**Figures 6**, **7**). While the negative stimuli are as previously reported (Anstis et al., 2012), the positive stimuli are new and are designed to induce the positive effect. **Figures 6**, **7** demonstrate the model's predictions for the negative and positive versions of averaging effect, respectively. Note that only the positive configuration (**Figure 7**) induces a classical filling-in, since this is the only

FIGURE 4 | A comparison of our model's predictions to that of Francis for the two locations for the remaining edge that Francis tested experimentally. (A) The chromatic stimulus. (B) The remaining contours. (C) The simulation results as reported in (Francis, 2010; Kim and Francis, 2011). (D) The simulation results of the suggested model. In the first row, the inner drawn contours (B) overlap the chromatic gradients that exist in the inducing stimulus (A). In the second row, the inner drawn contours do not overlap the chromatic gradients of the inducing stimulus. The results in (D) are in agreement with psycho-physical experiments (Kim and Francis, 2011).

FIGURE 5 | The model predictions for the spiral stimuls with variations in the remaining contours. (A) The chromatic stimulus. (B) The remaining contours. (C) Model's predictions. In this figure, the chromatic stimulus is the same in all the rows, column (A), but the remaining contours are different. In the first row, the drawn contour is a full spiral. In the second row, the outer edge of the spiral shape is presented and in the third row the drawn contour is the inner edge of the spiral shape, column (B). Our model predicts that both cyan and red colors are dominant in the full spiral (first row). When the remaining contour is the outer one, the dominant percieved color is reddish (second row), while, when the remaining contour is on the inner side, the dominant color is cyan (third row).

configuration where there is an achromatic area that can be filled with color.

### DISCUSSION

The suggested model involves several stages that can be regarded as a cascade of component mechanisms and responses, i.e., a

FIGURE 6 | Model predictions for averaging of negative afterimage colors (Anstis et al., 2012). (A) The chromatic stimulus. (B) The remaining contours with two different locations. (C) The model's predictions. It appears that colors of the remaining contours determine the role of color averaging.

short duration chromatic stimulus, cessation of this stimulus, creation of complementary chromatic edges which trigger a diffusion process. The suggested model predicts afterimage phenomena, which some of them might appear as "opposite ("conflicting") effects," through the same mechanism and therefore the same equations.

there is also an averaging of colors in the positive effect with these new

averaging color stimuli.

We present here a model that is able to predict both the negative and the positive effects, i.e., where the illusionary filledin color is either the same color or is complementary to that of the inducer. The model, therefore can also predict both the famous "filling-in the afterimage after the image" illusion and the "color dove illusion" (van Lier et al., 2009; Barkan and Spitzer, 2017). In addition, the model can also predict both the positive and the negative versions of the effect in shapes that possess non-closed remaining edges and successfully predicted a recently reported predominantly negative afterimage effect related to averaging of colors (Anstis et al., 2012), **Figure 6**.

It might be claimed that diffusion models have been previously suggested to predict the aftereffect in general, and also to predict the alternating aftereffect (Grossberg and Mingolla, 1985; Grossberg and Todorovic, 1988; Francis and Rothmayer, 2003; Francis and Ericson, 2004; Francis and Schoonveld, 2005; Wede and Francis, 2006; Van Horn and Francis, 2008). However, in contrast to previous models, such as FACADE, in our model the trigger for the diffusion mechanism is a "heat source," which implements the diffusion (or heat) equation with a "heat source," Poisson equation (Weickert, 1998). In other words, in our model, the edges are the only trigger for the diffusion process, and have no other role, for example as direct blockers to the diffusion process, as presented in the FACADE model (Grossberg and Mingolla, 1985; Grossberg and Todorovic, 1988; Francis and Rothmayer, 2003; Francis and Ericson, 2004; Francis and Schoonveld, 2005; Wede and Francis, 2006; Van Horn and Francis, 2008). This difference in rationale between FACADE and our model leads to a different structure of diffusion models, (Equation 7). While the FACADE model is composed of two separated components 1) Boundary contour system (BCS) 2) Feature contour system (FCS), our model is consisted of a single component. This component includes both borders and diffusion mechanism, which are computed in the same process (Equation 7). It is not surprising that such differences give rise to different model predictions in the two types of models, as will be described below.

The model described by Francis (2010) succeeded in predicting the negative effect (van Lier et al., 2009), in which the visual afterimage could spread across regions that were not colored in the inducing stimulus. He also could show, by the application of the FACADE model (Grossberg and Mingolla, 1985), that the perceived color and shape of the afterimage could be manipulated by remaining contours that apparently trapped the spread of afterimage color signals. However, this model also mistakenly predicts that a remaining edge will block the spread of color even if there is no overlap with the chromatic gradient edge border (**Figure 1B** in: Francis, 2010). This prediction is in disagreement with the psychophysical findings of the experiments conducted by Kim and Francis (2011). In contrast, our simulations indicate that the diffusion process is not blocked when the achromatic remaining contours do not overlap the chromatic contours.

In addition, as already claimed in the introduction, Francis's model cannot predict the positive effect, since his model assumes that the spread of complementary color across a surface will be blocked by the remaining contour. According to the Francis model (Francis, 2010), the positive effect is predicted to be negated, due to the role of the remaining contour which prevents diffusion of the color to the inner part of the shape. Consequently, the model cannot predict the possibility of obtaining result that shows perception of the same color as of the inducer at a different spatial location. Our model, on the other hand, can predict the positive effect (**Figure 3**), since it assumes that the main role of the contours is to trigger the diffusion process and not primarily aimed to block the diffusion process.

It should be clarified that the FACADE model has been implemented with a number of different diffusion algorithms. Francis, for example, implemented the filling-in process by using a Connected-Component algorithm (Francis and Rothmayer, 2003; Francis, 2010). In the FACADE models the diffusion process is implemented with iterative algorithm, whereas each pixel is averaged with adjacent pixels only if the neighbors are not edges (Grossberg and Todorovic, 1988; Francis and Ericson, 2004; Francis and Schoonveld, 2005; Wede and Francis, 2006; Van Horn and Francis, 2008). In additional studies (Francis and Ericson, 2004; Francis and Schoonveld, 2005) the diffusion model was extended in order to predict additional properties that are related to the MCAI effect. Consequently, the investigators suggested a "non-diffusion" filling-in mechanism, built from directional operations. It has to be noted that in order to predict the MCAI effect a special component was added to the FACADE model, which express the inhibition between orthogonal oriented grids (Francis and Ericson, 2004; Francis and Schoonveld, 2005; Wede and Francis, 2006; Van Horn and Francis, 2008). One important question is whether any of these previous diffusion implementations of the FACADE model (Grossberg and Mingolla, 1985; Grossberg and Todorovic, 1988; Francis and Rothmayer, 2003; Francis and Ericson, 2004; Francis and Schoonveld, 2005; Wede and Francis, 2006; Van Horn and Francis, 2008; Francis, 2010) can successfully predict the positive effect and its variations.

Since the FACADE models mentioned above share the same BCS, which trap the diffusion process and prevent diffusion of the color to the inner part of the shape, they wrongly predict the blockage of the diffusion process in the inner shape, as described experimentally (Kim and Francis, 2011). They also cannot predict the possibility of obtaining the same color as the inducer at different spatial locations and thus cannot predict the positive effect.

While both types of model (ours and the FACADE) assume that the filling-in process is performed by the isomorphic diffusion mechanisms, other groups have suggested that the symbolic mechanism might determine the diffusion process (von der Heydt et al., 2003; Komatsu, 2006; On and van Boxtel, 2017). According to the symbolic theory, "early visual areas extract only the contrast information at the surface border, while the color and shape of the surface are reconstructed in higher areas on the basis of this information" (Komatsu, 2006). Komatsu (2006) reported, however, that neuronal activity of V1 and V2 plays a role in most of the filling-in phenomena such as filling-in at the blind spot, the Craik–O'Brien–Cornsweet illusion, or neon color spreading.

A recent experimental study (On and van Boxtel, 2017) suggested a symbolic mechanism for the negative effect seen in the "stars" of van Lier et al. (2009)." They hypothesized that transparency cues play an important role in the filling-in process of the negative effect and attempted to validate this suggestion

through psychophysical experiments. Their results indicated that transparency clues are a prerequisite for the perceived filling-in effect. When the transparency cues were eliminated by removing one color from the star, the new stimulus contained only one color (**Figure 1B**, in: On and van Boxtel, 2017, **Figure 8**), and the filling-in effect indeed vanished. However, there is a different and even simpler explanation that can explain their psychophysical results.

**Figure 8** demonstrates our model's prediction for this specific star stimulus. The rationale for this correct prediction is based on the fact that if a combination of the negative and the positive effects act on the same spatial location they cancel each other out, as a result of the simultaneous induction of complementary colors in the same spatial location, **Figure 8**, (Hazenberg and van Lier, 2013). The original star stimulus of van Lier et al. (2009) consisted of a similar combination of negative and the positive effects, although in this case the two effects enhanced each other. This enhancement was due to the fact that the stars contain two complementary colors (cyan and reddish). When the cyan fourpoint-star is located inside the remaining contour, the negative effect is produced and the perceived complementary color is reddish. In this case, however, because this reddish four-pointstar is located outside the remaining contour, it gives rise to the positive effect, where the perceived color would also be reddish. As a result, the perceived reddish color is enhanced, as a result of the combination of the positive and the negative effects.

It is interesting to consider the stages of analysis of the proposed model as related to components of the visual system. The formation of a complementary or opponent chromatic edge following the cessation of chromatic stimulus (**Figure 2**) has recently been described in the literature as being attributable to a rebound response (Off response), evoked as a burst of spikes from neurons released from the period of inhibition (Spitzer et al., 1993; Grunfeld and Spitzer, 1995; Francis, 2010; Zaidi et al., 2012). The mechanism by which this produces the perception of the complementary color was suggested to be through cross inhibition between opponent channels (Grossberg, 1972; Francis, 2010), or through fast adaptation from the first order (Spitzer and Semo, 2002; Spitzer and Barkan, 2005). The mechanism suggested for the rebound model of Grunfeld and Spitzer (1995) includes the parameters required for the rebound effect, such as the duration of adaptation, the rate and the intensity of the offset of the stimulus. The current model does not include these additional stimulus parameters, but we plan to include these parameters in future.

The development of a further stage of the model has to be discussed in relation to the visual system and to other models. After the rebound response creates the complementary color, the diffusion process is triggered by different components in each model. According to the FACADE model (Grossberg and Mingolla, 1985; Grossberg and Todorovic, 1988; Francis and Rothmayer, 2003; Francis and Ericson, 2004; Francis and Schoonveld, 2005; Wede and Francis, 2006; Van Horn and Francis, 2008), the trigger for the diffusion process is the color of the surface at each location. This was described as "color spreads all across the surface within the boundary" (Kim and Francis, 2011). In contrast, in our model, the borders (the chromatic edges, i.e., double opponent, in the chromatic stimulus and the remaining contours, as a modulation to the chromatic edges) are the trigger for the diffusion process (Equation 7).

The experimental results of Hazenberg and van Lier (2013) appear to support our model with regard to the trigger for the diffusion process. These researchers demonstrated experimentally that the location of remaining contour that overlaps the chromatic edge can determine whether the result will be a positive or a negative effect. In fact, our model suggests that the perceived chromatic edge triggers an isomorphic filling-in process, according to isomorphic filling-in theory (von der Heydt et al., 2003). It should be noted that the idea that an afterimage of the chromatic contours triggers the isomorphic diffusion process has been raised previously by Hazenberg and van Lier (2013). It has also been suggested that the color signals in this type of filling-in process, spread in all directions except across borders formed by contour activity (Gerrits and Vendrik, 1970; Cohen and Grossberg, 1984; Arrington, 1994; von der Heydt et al., 2003). The role of the remaining contour is therefore in agreement with the previous suggestion that the contours act as diffusion barriers (Cohen and Grossberg, 1984; von der Heydt et al., 2003). However, according to the current model, this remaining contour is effective as a barrier only when it overlaps with the original chromatic edge of the inducer stimulus. Our model therefore suggests that the remaining contour fulfills two functions: a. enhancing the effect of the inverted chromatic edge Equation (4), b. trapping the diffusion. This dual role is supported by the isomorphic filling-in theory of von der Heydt et al. (2003) who suggested that the chromatic or achromatic receptive field plays a role in the filling-in process. The chromatic-edge receptive fields receive additional activation through horizontal connections (Gilbert and Wiesel, 1979), which keep the border activity high. Their suggestion is general and was not specifically related to the visual effects discussed here (the positive and negative effects).

In addition to the crucial role of the remaining contour, which overlap the chromatic gradients, the chromatic edges (by themselves) also play a role in the perceived afterimage, (Equation 11). This assumption was supported by the findings of Hazenberg and van Lier (2013), who reported that the fillingin process, (in their version for the positive effect), should be influenced less by the chromatic gradients (Anstis et al., 1978; Hazenberg and van Lier, 2013).

Since the model takes into account the role of the chromatic edges, albeit with less weight than the remaining contour, it predicts that the diffusion at the negative effect will be partially blocked by the original chromatic gradient of the inducing stimulus. As a result, it predicts that the diffusion will not spread to the central area in the negative effect stimuli, **Figure 3**.

Our model predicts that if a border does not exist in the original inducing stimulus, it will not block the diffusion process, as found psychophysically (Kim and Francis, 2011). After conducting psychophysical experiments, Kim and Francis (2011) formulated a qualitative rule that additional contours block color spreading when these contours overlap the inducer edges, but not when they are separated (**Supplementary Figure 1**). It has to be noted that our model's predictions of these results also agree with the qualitative arguments of Hazenberg and van Lier (2013) that there has to be a match (or overlap) between the chromatic edges and the remaining contours. This is derived from a repeated activation of orientation selective neurons that also code for color (von der Heydt et al., 2003).

We also investigated the question of whether it is necessary for the remaining contour to be closed or whether an open spiral stimulus, (**Figure 5**) can produce the effect. Preliminary results are in agreement with our model predictions that the effect can exist in open boundary conditions (**Figure 5**). It should be noted that Francis's simulations cannot predict the negative effect on open boundary conditions, such as in the spiral stimulus (**Figure 5**), because his model depends on a boundary that traps the spread of color (Francis, 2010). However, by applying a previous diffusion model as in Grossberg and Todorovic (1988), a correct prediction can be achieved, but only for the negative parts of the spiral illusion (i.e., only the configuration where the inner border of the spiral is displayed, third row of **Figure 5**). This is because this case involves a difusion process rather than a Connected Component algorithm as in the Francis implementation (Francis and Rothmayer, 2003; Francis, 2010). However, this modification still cannot predict the positive effect in the spiral illusion (second row of **Figure 5**).

A further question was whether the aftereffects can be perceived from spatial averaging within the area of remaining contours. Anstis et al. (2012) showed that colors can undergo spatial averaging within, but not across, contours but tested this effect only on the negative aftereffect. Our model's simulations (**Figures 6**, **7**) are with agreement to the experiments conducted by Anstis et al. (2012). We believe that even if the Francis model was able to predict this averaging effect, it could only work on the negative configuration of the effect.

Our results thus far suggest that the same basic mechanism is responsible for both the negative and the positive effects, but there remains a question as to whether there are additional mechanism's components that differentiate between these two mechanisms. The recent study of Hazenberg and van Lier (2013) can shed a light on this issue, since they investigated several properties of the positive and the negative effects on the afterimage watercolor stimuli. Specifically, they examined the role of the intensity of the inner area of the inducer stimulus and the remaining contour with reference to the positive and the negative effects.

The results of their study indicated that the filling-in effect was stronger in the negative effect under conditions where the inner area of the inducer stimulus was gray (iso-luminance with the chromatic borders) rather than white. This preference was not found in the positive effect. Hazenberg and van Lier (2013) interpreted these findings as the result of the influence of the luminance border between the inner chromatic contour and the interior area. This luminance border was presumed to prevent the colored afterimage of the chromatic contour from spreading. However, under iso-luminance conditions, the luminance borders do not exist, and indeed, the filling-in process is more prominently perceived. Our model can be modified, by taking into account a combination of the chromatic and the achromatic gradients of the chromatic stimulus, in order to predict this influence on the inner area intensity. Due to the differences related to the positive and negative effects, our model predicts that the negative effect will be more prominent with regard to the degree of saturation, while the positive effect will be more prominent in its ability to perform a filling-in task. This prediction should be confirmed by psychophysical experiments.

In order to test the role of the intensity of the remaining contour Hazenberg and van Lier (2013) used thick contours colored either light or dark gray as the remaining contours. They reported that the filling-in effect was perceived only when the contours were gray and not black, and only in case of the positive effect (i.e., where the perceived color is the same as the inducer).

We now suggest that according to our model (**Figure 3**), both gray and black contours can create a complementary color effect, but only in the near vicinity of the chromatic border in the original chromatic stimulus. It is possible that the lack of fillingin color in the positive effect (**Figure 8** in: Hazenberg and van Lier, 2013), was a consequence of the contour thickness of the remaining contours, since in the positive effect, the color has to diffuse through the remaining contour. The border contrast with a gray contour is weaker, and therefore reveals a partial filling-in effect. We suggest that the negative effect was not observed in the reported experiments (Hazenberg and van Lier, 2013) because they were looking mainly at the central area of the stimulus. Such a filling-in color is not expected in the inner white area (**Figure 7** in: Hazenberg and van Lier, 2013) because it is blocked by the luminance border, which contributes to the blockage of the filling-in process [Equation (7), **Figure 3**].

Additional factors that might affect the degree of the aftereffect e.g., include the size of the inducer and induced area, the shape curvature of the chromatic edge, and the exposure duration of the chromatic stimulus. These factors should be separately investigated experimentally for their influence on the positive and the negative effects. Psychophysical experiments are important in order to detect differences in the mechanisms acting in these two types of effects. In addition, psychophysical experiment are required for cases where the remaining contours that trigger the filling-in effect are illusory contours, such as those in the Kanizsa effect and the Neon color spreading effects (Van Tuijl, 1975; Kanizsa, 1976). This should be tested separately for the positive and negative effects. Our model predicts that for an illusory contour stimulus (which replaces the achromatic remaining contour), the chromatic and the illusory remaining contour have to overlap. However, we believe that the mechanism which creates the illusory contour (as produced in a Kanizsa illusion) is different from the filling-in mechanism. [Different computational models suggested in the literature for the Kanizsa illusion (Grossberg and Mingolla, 1985, 1987; Heitger et al., 1998; Ron and Spitzer, 2011)]. In order to include the prediction of the filling-in effect triggered by illusory contours, we will need to combine the different mechanisms of the illusory contours and the filling-in mechanism, and will therefore need to add an additional model component to detect the illusory contours.

The MCAI Effect (MacKay, 1957; Vidyasagar et al., 1999) is an alternating aftereffect but it differs from the positive and the negative aftereffects, as it contains an additional component, which relates to a different mechanism. This component enables oriented adaptation in the MCAI oriented stimulus (more specifically, of the flickering grid in the relevant stimulus). We expect that our filling-in model will predict this MCAI effect, but only if an additional component, which describes such oriented adaption mechanism (of the MCAI effect), will be added to the model.

Even though the present model does not permit predictions of the behavior of all the free parameters that play a role in the negative and positive effects, this is the first time that a computational model has been able to make crucial predictions

#### REFERENCES


on both the positive and the negative effects. In other words, our model succeeds in predicting apparently conflicting phenomena, i.e., those producing the complementary or same color aftereffect, and implies that the same mechanisms function in both effects despite the different manifestations. An important conclusion of this study is that a different appearance does not necessarily infer a difference in the causative mechanisms and driving forces.

The proposed model has several possible applications with the potential to be an applicable algorithm for the restoration of corrupted old images and videos, for example. Such an algorithm may be able to make an educated guess for filling-in color, based on partial information, such as having only remaining contours.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### ACKNOWLEDGMENTS

A short version of this model was presented in a conference (Cohen-Duwek and Spitzer, 2017).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2018.00559/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Cohen-Duwek and Spitzer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Predicting Illusory Contours Without Extracting Special Image Features

Albert Yankelovich<sup>1</sup> \* and Hedva Spitzer <sup>2</sup>

*<sup>1</sup> Department of Biomedical Engineering, Faculty of Engineering, Tel Aviv University, Tel Aviv, Israel, <sup>2</sup> Faculty of Engineering, School of Electrical Engineering, Tel Aviv University, Tel Aviv, Israel*

Boundary completion is one of the desired properties of a robust object boundary detection model, since in real-word images the object boundaries are commonly not fully and clearly seen. An extreme example of boundary completion occurs in images with illusory contours, where the visual system completes boundaries in locations without intensity gradient. Most illusory contour models extract special image features, such as L and T junctions, while the task is known to be a difficult issue in real-world images. The proposed model uses a functional optimization approach, in which a cost value is assigned to any boundary arrangement to find the arrangement with minimal cost. The functional accounts for basic object properties, such as alignment with the image, object boundary continuity, and boundary simplicity. The encoding of these properties in the functional does not require special features extraction, since the alignment with the image only requires extraction of the image edges. The boundary arrangement is represented by a border ownership map, holding object boundary segments in discrete locations and directions. The model finds multiple possible image interpretations, which are ranked according to the probability that they are supposed to be perceived. This is achieved by using a novel approach to represent the different image interpretations by multiple functional local minima. The model is successfully applied to objects with real and illusory contours. In the case of Kanizsa illusion the model predicts both illusory and real (pacman) image interpretations. The model is a proof of concept and is currently restricted to synthetic gray-scale images with solid regions.

#### Reviewed by:

Edited by: *Jonathan D. Victor, Weill Cornell Medicine, Cornell University, United States*

*Mikhail Katkov, Weizmann Institute of Science, Israel Leila Montaser-Kouhsari, Columbia University, United States*

\*Correspondence:

*Albert Yankelovich alberovich@gmail.com*

Received: *14 July 2018* Accepted: *13 December 2018* Published: *18 January 2019*

#### Citation:

*Yankelovich A and Spitzer H (2019) Predicting Illusory Contours Without Extracting Special Image Features. Front. Comput. Neurosci. 12:106. doi: 10.3389/fncom.2018.00106* Keywords: figure ground segregation, illusory contours, functional minimization, multiple perceptions, computational Gestalt

# INTRODUCTION

An important and non-trivial task in process of image understanding is the detection of object boundaries, also termed figure-ground segregation or image segmentation. This task is especially difficult in conditions where the object boundary is not fully visible. The human visual system, in many cases, is able to construct the whole object boundary (Kanizsa, 1955). An extreme example of such a completion is demonstrated by illusory contours (**Figures 1A,B**), where the visual system "creates" object boundaries in locations without any intensity gradient (Schumann, 1900; Ehrenstein, 1925; Kanizsa, 1955; Gregory, 1972; Kennedy and Lee, 1976; Day and Jory, 1980; Prazdny, 1983; Bradley, 1987; Kennedy, 1988).

While numerous models for performing image segmentation have been reported (Leclerc, 1989; Nitzberg and Mumford, 1990; Pal and Pal, 1993), relatively few are designed to incorporate illusory contours. Most of the models are capable of generating illusory contours by

extracting special image features, such as L-junctions, Tjunctions, and line-ends (**Figure 1C)**, and using them as keypoints to create the illusory contours (Finkel and Edelman, 1989; Guy and Medioni, 1993; Williams and Hanson, 1994; Gove et al., 1995; Williams and Jacobs, 1995; review: Lesher, 1995; Kumaran et al., 1996; Heitger et al., 1998; Kogo et al., 2002; Ron and Spitzer, 2011). This approach is supported by psychophysical evidence that the existence of special image features are required for illusory contours to emerge (Rubin, 2001). Many of these models exploit neurophysiological knowledge about neuronal mechanisms of the visual system. For example, in the model of Heitger et al. (1998), the responses of end stopped cells that detect L-junctions and line-ends are grouped and added to the responses of simple cells, which detect image edges (image intensity gradient) to produce the illusory contour.

The special features extraction is a difficult task in real world images, since in order to decide which junctions are significant relative to others, the structure of the scene in the image needs to be understood (Nitzberg and Mumford, 1990). In addition, the fact that only a small fraction of the image is exploited for special feature extraction (image region around the special feature point) makes this approach less robust.

A widely accepted explanation of illusory contours is the perception of relative depth, where the illusory contour represents the boundary of an object located at an other depth than the region around it (Kanizsa, 1955; Coren, 1972; Gregory, 1972; Lesher, 1995). According to this point of view, the illusory contours are just regular object boundaries, with the object intensity being the same as that of the background. The object with the illusory contour is revealed by the objects that are being occluded behind it, as in **Figure 1B**. The special image features, such as L-junctions and line ends, can provide a clue for object occlusion. Extracting special features, however, means making a specific effort for illusory contours detection. In this case the illusory contours are not treated as the regular contours. We prefer not to extract special features and to use instead a common way to detect both real and illusory object boundaries. Detection of illusory contours without using special image features is very challenging, since it requires the prediction of contours ex nihilo, without using the occlusion clues.

An approach that has the potential of not extracting special image features is the functional optimization, used by some boundary detection models capable of generating illusory contours (Kass et al., 1988; Madarasmi et al., 1994; Williams and Hanson, 1994; Geiger et al., 1996; Saund, 1999; Gao et al., 2007). The functional is used to give a score for each contour configuration, and the final contours are not "constructed" by the model, but rather "come out" as the minimizer of the functional. Special features extraction is not necessarily required in these models, since the demand that the resultant boundaries will match the input image can be expressed in the functional without the special features extraction. An additional significant advantage of functional optimization approach is that giving a preference score to a given contour configuration is much simpler than constructing the correct contour configuration. The optimization approach is a computational realization of the Gestalt psychology (Koffka, 1935), since it derives the contours from some contour configuration preference rules ("grouping rules" in Gestalt psychology). By this it accounts for both real and illusory contours based on a general unified approach.

Kass et al. (1988) applied snakes algorithm of energy minimizing splines to track image edges. The continuity and elasticity properties of the snakes enable the illusory contours to emerge. This model indeed does not extract special image features, however, it is not fully automatic, since user interaction is required to draw the initial contour. One might argue that some automatic initial contours such as small circles matrix can be used, however in this case illusory contours will be extracted even for images that actually lack them. For example, the model will predict illusory contours for a Kanizsa illusion configuration with solid circle inducing elements, although in this case the illusory contour is not perceived. Currently there is no fully automatic boundary detection model that does not require special features extraction for illusory contours generation.

The proposed model is a proof of concept and is restricted to gray scale images with solid non-textured regions and without lines. The stress in the model is not on the way used to encode the Gestalt rules, nor on the rules themselves, but on the mere possibility by predicting real and illusory boundaries based solely on general boundary formation rules.

## METHODS

# Model Rational

The basic idea of the model was inspired by the assumption that object detection is one of the intelligent tasks performed by the visual system. This task uses a set of simple assumptions, based on our natural perception of an object's appearance, to provide the most reasonable "explanation" of what is presented in the image. With several possible perceptions of what we see, a critical question is what makes us prefer one perception over another? Especially we are interested to reveal the reason for perception of illusory contours. As an example, let us consider the Kanizsa triangle (Kanizsa, 1955) in **Figure 2** and examine the factors responsible for the perception of an illusory contour in this case.

The perception in **Figure 2A** is that of three "pacman" objects and the perception in **Figure 2B** is of a triangular object above three circular objects. In the pacman perception the boundary of each pacman has three corners (or bends)–two convex and one concave. On the other hand, in the triangle perception instead of three bends per pacman there is only one, since the circle is perceived as continuing under the triangle. Moreover, the concave bend in the circle center is replaced by a convex bend of the triangle vertex. The conclusion is that in the illusory interpretation the object's boundary is less bent and the bends are more convex. Both criteria can be derived from preference of simplest description (van Tuijl, 1975). The preference for convex bends also explains why in the image containing a square (**Figure 7D**), we perceive a square object more readily than a square hole.

Although the functional optimization approach enables us to avoid special features extraction, it has the drawback of having a tremendous search space of the possible solutions. To overcome this issue, we use an "economic" boundary representation called a border ownership map, holding boundary segments in discrete locations and discrete directions. Our representation is inspired by the neural findings of Zhou et al. (2000) who discovered V1 visual cortical cells that respond to an edge only when the object is located on one of the edge sides. This ability was already termed border ownership by Nakayama and Shimojo (1990). Using the border ownership map makes the free variable of the problem much smaller than using, for example, contour parametrization.

An additional difficulty is that the functional that accounts for several object boundary properties and depends on many variables has a large number of local minima. To overcome this, the functional was smoothed and the functional minimizers were found by gradual relaxation technique (Lee, 1995). This reduces the number of minima by smoothing out the shallow minima and finding only prominent stable minima.

In the proposed model we define a functional that accounts for basic object properties, such as boundary continuity and convexity, and demands the object boundaries to match the input image. The object boundaries are found as the minimizer of the functional. The illusory contours are predicted in same way as the real contours, by being the most probable object

boundaries matching the input image. This is the first time that the perception of illusory contours from a general object boundary detection task is shown computationally.

The minimizers of the functional are compared to the expected perception, known from psychophysical evidence. Due to the suggestion that the visual system is actually finding the best solution to object formation rules, we are not necessarily obliged to use the mechanisms of the visual system (which are also not fully known), to find that solution. It has to be noted that in spite of this the model exploits some of the physiological knowledge of low-level mechanisms of the visual system, such as simplification of visual cell receptive fields that perform edge detection [section Border ownership at image edges (F <sup>A</sup>)], logical "and" operation (Appendix section 1.2) and cell response grouping (Appendix section 1.1). In addition, the model includes the crucial component of the border ownership map, section Boundary Representation.

Using functional minimization in the model has an additional important benefit. Usually, there are several possible object configurations that can explain a single image (**Figure 2**). Multiple image interpretations are present even in a simplest image of a white square on black background (**Figure 7D**). This image can be interpreted as a white square object over a black background, or as a black frame with a square hole through which a white background is seen. The illusory Kanizsa triangle (**Figure 1B**), also has several possible interpretations. The most prominent is the illusory interpretation of a white triangle occluding three black solid circles and a black boundary triangle (Ringach and Shapley, 1996). An additional easily perceived interpretation does not include an illusory triangle, but consists of three cut-out circles, "pacmans", and three V-shaped figures. For real-world images there may be numerous plausible configurations of objects. The desired interpretation may be chosen, for example, by applying a higher level knowledge, like object recognition. The ability to predict multiple possible perceptions of the image is therefore a desired property of a robust boundary detection model. The multiple possible image interpretations, that are described above, are represented in our model by multiple minima of the functional.

FIGURE 3 | Border ownership map illustration. (A) Schematic illustration. The bold arrows are the border ownership vectors, while the vector length indicates the edge strength and the vector direction indicates the edge direction. The length of the ownership vector represents the probability that there is an object edge that passes through the vector origin in orientation perpendicular to the vector. The object is located on the side pointed by the border ownership vector. *l* is the discrete direction index, ranging from 0 to (*L* − 1). Here and in all other border ownership maps *L* = 12. At each coordinate there is a border ownership vector for all possible directions, although for sake of clarity only some of the vectors are shown here. The dashed lines represent the discrete grid over the image of a square (the dark area). The dotted diagonal lines going out of the origin of *l* = 0 vector which show all the possible discrete directions. The ellipse around the vector origin illustrates the area in which the object edge is represented by the border ownership vector that is relevant. This resembles the receptive field of a V1 neuron (Hubel and Wiesel, 1959). (B) The output of border ownership map of the model for an input of a square object image. The border ownership vectors point inside the square (solid line). The vector with the greatest length at a point is directed perpendicular to the square edge.

# Model Overview

The model consists of four main parts:


The main challenge of identifying illusory contours as a solution of a minimization problem is occupying the huge size of the solutions space. We attacked this problem by choosing a compact boundaries representation method and by applying various types of smoothing to the functional, in order to reduce the number of local minima. The smoothing leaves only the stable minima. A method was invented to find different local minima of the cost functional, section Finding Multiple Local Minima. Each local minimum corresponds to a possible image interpretation, with a lower cost for a more probable (pop-out) interpretation.

The variables notation below is that the subscript of a variable describes the discrete coordinate on which this variable is measured. For example, fxy is a filter intensity at coordinate x, y , for integer x and y. There are no continuous coordinates in the model. We omit the comma between the coordinates for brevity. The superscript of a variable is part of the variable name. For example, σ <sup>X</sup> is a constant. In the following we describe the model parts in more detail.

#### Boundary Representation

The border ownership map (**Figure 3A**), represents the probability that an object edge passes through a discrete coordinate in some discrete direction. The orientation of the object edge is perpendicular to the discrete direction, and the object resides on the side that is pointed by the pointed direction. As an example, **Figure 3B** represents the border ownership map of a square object. At each discrete coordinate, the border ownership is specified for a discrete set of equally distributed L orientations (**Figure 3A**). Note that for opposite directions there are two different border ownership values. The border ownership is not strictly a probability value. Only the relative values of border ownership are important. We choose to interpret positive and negative values of border ownership in the same way, since in the minimization process additional effort is required to avoid negative values. To achieve this interpretation, the border ownership always appears squared in the functional.

#### Cost Functional

The functional that depends on the border ownership map is designed to measure to what extent the expected properties of the object boundaries configuration are followed. Each property is allocated a cost functional component and the overall cost functional is a weighed sum of all the components.

$$\begin{aligned} F\left(\overrightarrow{\overline{b}}\right) &= \alpha^A F^A \left(\overrightarrow{\overline{b}}\right) + \alpha^R F^R \left(\overrightarrow{\overline{b}}\right) + \alpha^V F^V \left(\overrightarrow{\overline{b}}\right) \\ &+ \alpha^N F^N \left(\overrightarrow{\overline{b}}\right) + \alpha^C F^C \left(\overrightarrow{\overline{b}}\right) + \alpha^E F^E \left(\overrightarrow{\overline{b}}\right) \end{aligned} \tag{1}$$

Where F type are the cost functional components that are dependent on the border ownership map

$$
\overrightarrow{b}^\* = \left\{ b\_{\text{xy}l} \right\}\_{\text{x,y,l}} \tag{2}
$$

length denoted by 1 (pink arrow) in the image, the cost is as pointed in point 1 on the chart. For bigger border ownership denoted by 2 (pink arrow) in the image, the cost is as pointed in point 2, and is lower than the cost at point 1. (B) Illustration of border ownership limitation cost component *F <sup>R</sup>*. The cost increases with increasing the vector value of the border ownership in order to limit the infinite growth of the vector value, due to cost component *F <sup>A</sup>*. The polynomial degree of *F <sup>A</sup>* in (A) is 2, while the polynomial degree of *F <sup>R</sup>* is 4, which makes sure that the border ownership value will be limited. (C) Illustration of cost component *F <sup>V</sup>* , which gives penalty to border ownership in places with no edge in the image. The cost increases with increasing border ownership at a location with no edge in the image. (D) Cost component *F <sup>N</sup>* discourages border ownership in opposite directions, since an object is expected to be only on one side of the edge.

and α type are weight parameters. x, y are discrete coordinates and l is the discrete direction index. The first three components F A, F R and F <sup>V</sup> are responsible for appearance of border ownership at image edges. The other components are responsible for encoding the expected object boundary properties, and therefore depend only on the border ownership map and not on the input image. The component F <sup>N</sup> is designed to make sure that the object is located only on one side of a boundary. F C is responsible for object boundary continuity. F E gives penalty for bending in the object boundary, while concave bends receive a greater penalty, section Model Rational. The cost components are visualized in **Figures 4**, **5** and are described in the following paragraph. Since the full definition of the components F C and F E is more complicated and occupy larger volume, their details are provided in **Appendix** in Supplementary Material.

#### Border Ownership at Image Edges (F A)

This chapter describes how border ownership is induced from image edges. In the case of an intensity edge in the input image with a specific orientation, the border ownership in the perpendicular direction is encouraged. Since we do not know on which side of the edge the object is situated, the border ownerships are encouraged in both directions which are perpendicular to the edge. The cost component sums multiplication of the border ownership bxyl by the intensity of edge in the image in an orientation perpendicular to l, termed Axyl. This "encourages" border ownership perpendicular to the edge in input image (**Figure 4A**).

$$F^A = \frac{1}{T} \sum\_{\mathbf{x}, \mathbf{y}, l} -A\_{\mathbf{x}yl}{}^2 b\_{\mathbf{x}yl}{}^2 \tag{3}$$

where

$$A\_{\rm xyl} = I\_{\rm xyl} \ast f^A\_{\rm xyl} \tag{4}$$

The operation marked by <sup>∗</sup> is a discrete cross-correlation (or filtering), given by:

$$I\_{\mathbf{x}\mathbf{y}} \* f\_{\mathbf{x}\mathbf{y}l}^{A} = \sum\_{\mathbf{x}', \mathbf{y}'} I\_{(\mathbf{x} + \mathbf{x}') \cdot (\mathbf{y} + \mathbf{y}')} f\_{\mathbf{x}'\mathbf{y}'l}^{A} \tag{5}$$

The filter f A xyl detects an image edge at point x, y and orientation perpendicular to l. It is defined by rotation of function f A xy by 2π l L .

$$f\_{xy}^A = \frac{1}{2\pi\sigma^{A^2}} s\left(\mathbf{x}\right) e^{-\frac{\mathbf{x}^2 + y^2}{2\sigma^{A^2}}}\tag{6}$$

where function s(x) is a sign function, giving zero for values close to zero

$$s\left(\mathbf{x}\right) = \begin{cases} 0, & |\mathbf{x}| \le 0.001\\ \frac{\mathbf{x}}{|\mathbf{x}|}, & \text{else} \end{cases} \tag{7}$$

and σ <sup>A</sup> is a constant. The constant T is used to normalize the cost to be per coordinate and orientation and is given by:

$$T = I^X I^Y L \tag{8}$$

where I <sup>X</sup> and I Y are the width and height of the input image. The border ownership value bxyl in (3) is squared in order to have same cost for positive and negative values of border ownership, section Boundary Representation.

#### Border Ownership Is Limited (F R)

If F <sup>A</sup> was the only component of the functional, the border ownership at image edges would grow infinitely to make the cost lower. The following cost component is added to ensure that the value of border ownership is limited:

$$F^{\mathbb{R}} = \frac{1}{T} \sum\_{\mathbf{x}, \mathbf{y}, l} b\_{\mathbf{x}\mathbf{y}l}{}^4 \tag{9}$$

The reason for taking the border ownership to power 4 is to make F R stronger than F <sup>A</sup> at high border ownership values. The cost component F R is illustrated in **Figure 4B**.

#### Suppress Border Ownership in the Absence of Image Edge (F V )

An illusory contour introduces border ownership also at places with no intensity gradient in the image. To avoid spurious illusory contours, this component adds a penalty for boundary ownership in places with no edge in the input image (**Figure 4C**).

$$F^V = \frac{1}{T} \sum\_{\mathbf{x}, \mathbf{y}, l} \frac{\varepsilon^V}{A\_{\mathbf{x}\mathbf{y}l}^2 + \varepsilon^V} b\_{\mathbf{x}\mathbf{y}l}^{-2} \tag{10}$$

where ε <sup>V</sup> is a small constant and Axyl is intensity of edge in the image (4), used in component F <sup>A</sup>. Note that the equation and the rational of F <sup>A</sup> (3) and F <sup>V</sup> are similar, but have opposite trends, such that a large edge leads to lower cost, while a small edge causes a higher cost. The only functional components that depend on input image are F <sup>A</sup> and F <sup>V</sup>. They depend only on image edges and not on special image features, as required in previous models, section Introduction.

#### Object on One Side (F N)

The model assumes that the object usually resides on only one side of an edge. Hence, if there is border ownership in a specific direction, the border ownership in the opposite direction is discouraged (**Figure 4D**). If there is a significant border ownership in direction l, border ownership in opposite direction l+ L 2 is not expected, section Boundary Representation. Border ownership is also not expected in directions close to l + L 2 , therefore, we add a cost for border ownership vectors with deviation m from l + L 2 . We also consider border ownership in spatial vicinity to the border ownership vector origin x, y by filtering the border ownership map in space. The filtered border ownership map is termed B N xyl.

$$F^N = \frac{1}{TT^N} \sum\_{\mathbf{x}, \mathbf{y}, l} \sum\_{m=-\left(\frac{L}{4} - 1\right)}^{\frac{L}{4} - 1} \cos^2\left(2\pi \frac{m}{L}\right) B^N\_{\mathbf{x}\mathbf{y}l} B^N\_{\mathbf{x}\mathbf{y}\left(l + \frac{l}{2} + m\right)} \tag{11}$$

where

$$B\_{xyl}^{N} = b\_{xyl} ^2 \* f\_{\chi \circ}^{N} \tag{12}$$

and

$$f\_{xy}^{N} = \frac{1}{2\pi\sigma^{N^2}}e^{-\frac{x^2 + y^2}{2\sigma^{N^2}}}\tag{13}$$

For a larger deviation m, the cost increase should be smaller, thus a weight factor cos<sup>2</sup> 2π m L is added accordingly. The maximum deviation considered is <sup>L</sup> <sup>4</sup> <sup>−</sup> 1, since this is the maximum angle which is less than <sup>π</sup> 2 . The term T <sup>N</sup> (11) is used to normalize the contributions from all deviations and is given by

$$T^N = \sum\_{m=-\left(\frac{L}{4}-1\right)}^{\frac{L}{4}-1} \cos^2\left(2\pi\frac{m}{L}\right) \tag{14}$$

#### Object Boundary Continuity (F C)

One of the basic properties of an object is the continuity of its boundary, thus the boundary is not expected to end abruptly, unless it is occluded by the boundary of another object. To encourage object boundary continuity, we require that when an object edge ends at a coordinate, there should be an object edge originating from the same coordinate (**Figure 5A**). The occluding object edge plays the role of the originating edge to the occluded object ending edge, in case of occlusion (**Figure 5B**). The main innovation of the model is the mere possibility to predict illusory contours without special features extraction, following the functional optimization approach. Since the full details of this component are quite lengthy and the exact functional definition is not the main aim of the model, this component details are provided in Appendix section 1.1.

#### Object Boundary Bending (F E)

We concluded in section Model Rational that the preferred perception is the one with fewer bends, and if there are bends, then convex bends are preferable. Taking this preference into account, we will assign a positive cost for bends in the object boundary, with an increased penalty for concave bends (**Figure 5C**). The details of this component are also lengthy, hence they are provided in Appendix section 1.2.

#### Cost Functional Smoothing

The cost functional (1), accounting for several object boundary properties and depending on many variables, has a large number of local minima, while not all of them represent expected image interpretations. The problem is then how to "get rid" of these redundant local minima. We assume that the redundant local minima are shallower than desirable ones. To avoid trapping in shallow local minima, four types of smoothing methods are applied, as described in the following sections.

#### Border Ownership Map Smoothing in Angle and Space

To make the cost functional less sensitive to small changes in border ownership, the border ownership map −→<sup>b</sup> is smoothed in angle and space. The result −→b S is used as input to the cost functional (1).

$$b\_{\mathbf{x}\mathbf{y}l}^{\mathbb{S}} = \left[ \sum\_{j=-\left(\frac{l}{2}-1\right)}^{\frac{l}{2}} b\_{\mathbf{x}\mathbf{y}\left(l+j\right)} f\_{\mathbf{j}}^{\text{SA}} \right] \* f\_{\mathbf{x}\mathbf{y}}^{\text{SX}} \tag{15}$$

where f SA j and f SX xy are Gaussians in angle (A) and space (X) coordinates, respectively:

$$f\_j^{\text{SA}} = \frac{1}{\beta^{\text{SA}}} e^{-\frac{j^2}{2\sigma^{\text{SA}^2}}} \tag{16}$$

$$f\_{\mathbf{x}\mathbf{y}}^{\text{SX}} = \frac{1}{2\pi\sigma^{\text{SX}^2}} e^{-\frac{\mathbf{x}^2 + \mathbf{y}^2}{2\sigma^{\text{SX}^2}}} \tag{17}$$

with σ SA and σ SX constants, and β SA is a normalization constant:

$$\beta^{\text{SA}} = \sum\_{m=-\left(\frac{L}{2}-1\right)}^{\frac{L}{2}} e^{-\frac{m^2}{2\sigma^{\text{SA}^2}}} \tag{18}$$

#### Spatial Filters Smoothing

The cost functional calculation uses various spatial filters. To make the cost smoother and less dependent on the discrete grid step, we sum up the cost components on multiple spatial scales.

$$G^{\sharp p\text{\prime}e} = \frac{1}{N} \sum\_{n=0}^{N-1} F\_n^{\sharp p\text{\prime}e} \tag{19}$$

where N is the number of scales and F type <sup>n</sup> is the same as F type (1), except that it uses spatial filters derived by scaling the original filters by factor

$$
\mu^{\mathfrak{n}} \tag{20}
$$

where µ > 1 is a scaling constant. The smoothed components G type (17) are used in the functional instead of the components F type (1).

#### Ramp Function Smoothing

The ramp function

$$r\_{\,\,\,\,x}(\mathbf{x}) = \begin{cases} \mathbf{0}, & \mathbf{x} \le \mathbf{0} \\ \mathbf{x}, & \mathbf{x} > \mathbf{0} \end{cases} \tag{21}$$

is used in components F C and F E to account for positive and not negative values. There are two benefits in smoothing the ramp function r (x). The first is that the smoothed function is differentiable at x = 0 and the second is that the cost functional also becomes smoother, which reduces the number of local minima. The smoothed function is obtained by filtering r (x) through a Gaussian function:

$$\frac{1}{\sqrt{2\pi\sigma^{RP^2}}}e^{-\frac{\kappa^2}{2\sigma^{RP^2}}}\tag{22}$$

Where σ RP is a constant.

#### Gradual Relaxation-Find the Minimum at Coarse to Fine Scale

In order to avoid trapping into shallow local minima, the minimum is found first on a coarse and then at a finer scale, a method called gradual relaxation (Lee, 1995). This is done by first finding the minimum of the functional on a broad scale. Then, the border ownership found is used as the initial point for finding the minimum on a finer scale. This process is repeated until the desired detailed scale is reached. The details of this process are as follows. A scale parameter s is initially set to s <sup>0</sup> > 0. To proceed to a more detailed scale, the scale parameter s is multiplied by constant s <sup>R</sup> with 0 < s <sup>R</sup> < 1. The process is finished when the desired resolution of s <sup>M</sup> is reached. For the scale s <sup>M</sup> the smoothed functional is close to the functional without smoothing. The scale parameter s influences the model as follows.

The border ownership smoothing scale σ SX (17) is multiplied by:

$$s^{B0} + s^{BS}s \tag{23}$$

where s B0 and s BS are constants. The scale µ n (20) of spatial filters smoothing, is multiplied by:

$$s^{\chi\_0} + s^{\chi\_S}s \tag{24}$$

where s X0 and s XS are constants. The width of Gaussian (22) used for the ramp function smoothing is multiplied by:

$$s^{RO} + s^{RS}s \tag{25}$$

where s R0 , s RS are constants.

#### Finding the Local Minimum

The search for a minimum starts from a random border ownership map −→b R , with component values selected from a uniform random distribution, in the range [0.01, 0.02]. The reason for starting with a random border ownership rather than a zero vector is to avoid being trapped in a saddle point. For each scale parameter s, section Gradual Relaxation-Find the Minimum at Coarse to Fine Scale, the method used to search for the local minimum is a variant of a gradient descent (Curry, 1944). Suppose that at gradient descent iteration i, the current border ownership map is −→b i . We find the derivative of cost functional at −→b <sup>i</sup> with respect to each border ownership component bxyl:

$$
\overrightarrow{D} = \frac{\partial F}{\partial \overrightarrow{b}} \left( \overrightarrow{b}^{\dagger} \right) = \left\{ \frac{\partial F}{\partial b\_{\text{xy}l}} \left( \overrightarrow{b}^{\dagger} \right) \right\}\_{\text{x,y,l}} \tag{26}
$$

−→<sup>D</sup> is a matrix pointing in the direction of the greatest increase of F (1). To move toward the minimum of F, we need to move in the opposite direction − −→<sup>D</sup> . The functional F near the minimum is roughly second order, see Appendix section 1.3. Based on this, we approximate the values of F along − −→<sup>D</sup> by a parabola and move to its minimum. The details of this process are specified in Appendix section 1.3.

#### Finding Multiple Local Minima

The multiple local minima of the cost functional correspond to different interpretations of the image, section Introduction. Although there are several well established methods for finding a single minimum of a functional, there are relatively few studies on how to find multiple minima. The main question is how to escape from the first local minimum, in which the minimization process stopped. We attack this problem by positioning a "repulsive particle" in the location of the first local minimum. Here by location we mean the border ownership map of the minimum. The repulsive particle acts like an electric charge that repulses the border ownership map that is being searched and prevents it from coming too close to the repulsive particle location. This is achieved by adding to the cost functional (1) a component that increases for border ownership maps that are close to the first local minimum. This component is described in details in Appendix section 1.4, and it resembles an electric potential. The process of finding multiple local minima is performed as follows.

The gradient descent starts from some random border ownership −→b R , section Finding the Local Minimum, to obtain a local minimum for border ownership −→b 1 , (**Figure 6**). To find additional local minimum we place a repulsive particle at the −→b <sup>1</sup> position (red −→b 1 in **Figure 6**) and reinitiate the search for new local minimum from −→b R . Suppose that now the new local minimum is −→b 2 ′ (magenta −→b 2 ′ ). The repulsive particle at −→b 1 causes −→b 2 ′ to be pulled out further from

the actual local minimum of the cost functional. To find the actual local minimum, we start a new search for the minimum of the functional without repulsive particle component from location −→b 2 ′ . Suppose the search reached the minimum for −→b 2 .

If −→b 2 is sufficiently far from −→b 1 , then −→b 2 is added as a new interpretation and a repulsive particle is added at −→b 2 . To measure how close −→b 1 is to −→b 2 , the following simple distance measure is used:

$$\frac{1}{T} \sum\_{\mathbf{x}, \mathbf{y}, l} \left| b\_{\mathbf{x}\mathbf{y}l}^{\mathbf{1}}{}^2 - b\_{\mathbf{x}\mathbf{y}l}^2 \right| \tag{27}$$

where T is defined in (8). If this distance is above a specific threshold level d T , the particles are considered different. If −→b 2 is close to −→b 1 (27), then the optimization is trapped into a local minimum that has been already identified. Since the search was trapped twice in the same local minimum, we try to increase the force of the repulsive particle. This is achieved by multiplying the repulsive term by a constant factor τ > 1. In order to avoid the same location −→b 2 ′ again, an additional repulsive particle is added at the −→b 2 ′ location, and the search for the minimum is repeated from a start point at −→b R (**Figure 6**). After finding this minimum we perform a new search, but without the repulsive particle component, in order to find the actual local minimum of the original functional. If a new particle is found, then the new particle is added as additional interpretation. The repulsive force is returned to its initial strength (without multiplication by τ ) and a search for a new particle is performed. If, on the other hand, no new particle is found, the repulsive force factor is multiplied again by τ . The repulsive force multiplication factor is increased until a maximum factor τ max is reached. If even for the maximum multiplication factor no new particle is found, then the process of finding multiple local minima is stopped.

#### Retrieving Object Shape by Contour Evolution

At this stage, the output of the model is a border ownership map (2) that assigns border ownership strength values to each discrete location and direction. To show that the actual object shape can be easily and automatically retrieved from the border ownership map, we designed a simple contour evolution algorithm that finds the top-most object in the scene. The contour evolution method finds a contour which maximizes a given functional that depends on the contour. The way to find the maximizing contour is by moving some initial contour toward the contour that brings the functional to maximum. In the level set approach, the contour is represented by the intersection of a two dimensional function ψ with x-y plane, that is by the zero-level of the function ψ. The contour motion is described and performed in terms of the function ψ. For further details see Osher and Sethian (1988).

We start with a simple small object (e.g., circular contour) which is adjacent to the border ownership vector with the biggest value. The contour representing the object boundary is then moved to maximize the border ownership vectors having direction perpendicular to the contour. Following Malladi et al. (1995), the contour dynamics is defined by:

$$
\overrightarrow{C}\_t = \left(k - \nu\right)g\overrightarrow{N}\tag{28}
$$

−→C t is the velocity of moving the contour −→<sup>C</sup> . −→<sup>N</sup> is the contour normal vector, pointing toward the inner area of the object. The contour is moved in direction of the normal. The velocity magnitude is defined by k − v g, (28). This function is designed to cause the contour to grow until it reaches the highest value of border ownership vectors and to keep the contour as simple as possible. The term k is the contour curvature and the operation of including this term makes the contour tend to be as straight as possible. This is because a point with positive curvature, that is a convex point, the contour is "encouraged" to move inside, which decreases the curvature. For negative curvature the contour is "encouraged" to move outwards, decreasing the absolute curvature and again making the contour more straight. v is a constant called the balloon force, giving the contour the tendency to grow. The contour friction term g causes the contour to stop when it reaches a high value of border ownership vectors in the direction perpendicular to the contour. g is a threshold of another function h:

$$\mathbf{g}\_{\mathbf{x}\mathbf{y}} = \begin{cases} \mathbf{0}, & h\_{\mathbf{x}\mathbf{y}} < \mathbf{g}^T \\ h\_{\mathbf{x}\mathbf{y}}, & \text{else} \end{cases} \tag{29}$$

$$h\_{\chi\chi} = \frac{1}{\left(1 + \frac{q\_{\chi\chi}}{R^2}\right)}\tag{30}$$

where R is a constant and qxy measures the strength of the border ownership in a direction roughly perpendicular to the contour. hxy is designed such that it will be small in locations where the value of the border ownership perpendicular to the contour is high. Since hxy is small in this locations, gxy will be zero and the contour evolution will stop. qxy is given by:

$$q\_{\mathbf{x}\mathbf{y}} = \sum\_{l=1}^{L} \left| \mathbf{w}\_l b\_{\mathbf{x}\mathbf{y}l} \right|^2 \tag{31}$$

The weighting factor w<sup>l</sup> measures how close the direction l is to the direction of the contour normal:

$$w\_l = e^{-\frac{\beta\_l^{-2}}{2\sigma^{Q^2}}}\tag{32}$$

where σ <sup>Q</sup> is a constant, and β<sup>l</sup> is the angle between the direction of index l and the contour normal, pointing toward the inner area of the object:

$$\beta\_l = \cos^{-1}\left(\overrightarrow{\mu}\_l \cdot \overrightarrow{N}\right) \tag{33}$$

And −→<sup>u</sup> <sup>l</sup> is the unit vector in direction of index l:

$$\overrightarrow{u}\,^{l}\_{\,l} = \left(\cos\alpha\_{l}, \sin\alpha\_{l}\right), \quad \alpha\_{l} = 2\pi\frac{l}{L} \tag{34}$$

Further details of the approach in field of level set curve evolution can be supplied from Osher and Sethian (1988).

#### RESULTS

The model was tested on various simple synthetic gray scale images with non-textured regions. The same set of model parameters were used for all tests and stimuli. The parameters were chosen by trial and error.

The first image contains two adjacent regions separated by a straight line (**Figure 7A**). Two local minima were found for this image, one corresponding to a black object on the right side over white background (**Figure 7B**), and the other one found relating to a white object on the left side over the black background (**Figure 7C**). Note that the two interpretations have equal cost −53.1, since there is no preference for the object to be on the right or on the left side.

The next tested image was a square (**Figure 7D**), also having two interpretations. The first interpretation was of a square object (**Figure 7E**), and the second interpretation was of a frame with a square hole (**Figure 7F**). The square object interpretation has cost −117, while the square hole in a frame interpretation has a higher cost −102. This is consistent with the fact that the square interpretation is perceived more readily than the square hole interpretation, section Model Rational . In all results the interpretations are presented ordered from lower to higher cost. The model behaves in the same manner for a larger square with size of 20 pixels (results are not shown). For a more complex image of an object with both convex and concave vertexes (**Figure 7G**), the model identifies two interpretations, the first corresponding to a C-shaped object (**Figure 7H**), and the second to a frame with a C-shaped hole (**Figure 7I**).

The main goal of the study was to show the possibility to detect objects with illusory contours without extracting special image features. To show this, the model was applied on Kanizsa squares with different sizes. One of the essential factors that determines the strength of the illusory contour is the ratio between the visible edge length and the total edge length, termed support ratio (Shipley and Kellman, 1992; **Figure 8A**). The illusory object is perceived when the support ratio values are close to 1. The model was tested on images corresponding to a broad range of support ratios. The first example is of a prominent illusory contour image (**Figure 8A**), with a relatively high support ratio of 0.67. The first interpretation, having the smallest cost −67.3, is the interpretation of an illusory square (**Figures 8B,C**). The second interpretation, having a higher cost −64.6, is of four pacemans (**Figure 9**). These two interpretations are consistent with our expectations from the model.

Additional higher cost interpretations have been found, and are not presented here. The smallest support ratio for which the illusory square is still detected for this pacman radius is 0.57. **Figure 10** shows the first interpretation for this support ratio. For a smaller support ratio of 0.53 the first interpretation is of four pacmans (the border ownership map is not shown, but has the same structure as the interpretation in **Figure 9**). For this support ratio there is no illusory interpretation at all, as expected.

To ensure that the illusory square border ownership map (**Figure 10**), can be interpreted as a square over four circles we applied a level set optimization method to extract the nearest object, section Retrieving object shape by contour evolution. The result of object extraction is shown in **Figure 11**. It shows detection of the square object with a partially illusory boundary.

#### DISCUSSION

The proposed model successfully extracts both real and illusory contours in various synthetic images (**Figures 7**–**10**). The model is generic and was not specifically designed to detect illusory contours, while special image features are not extracted. The illusory contour detection was achieved by introducing only simple desired object properties, and the illusory parts of the object boundary were generated as the most reasonable image "description" obtained by the functional minimization. The model shows the possibility to view the illusory contours as derived from general object detection task, performed by the visual system. Although this idea is not new (Gregory, 1972), this is the first time that the possibility to derive illusory contours from general object boundary detection task has been proved computationally.

Moreover, the multiple possible image perceptions were predicted here and ranked by perception probability. In case of the Kanizsa square illusion image, the most probable perception predicted by the model is of an illusory square (**Figures 8B,C**), and the second perception is of four pacman objects (**Figure 9**).


FIGURE 7 | (A) The simplest input image, with size 20 × 20 pixels. (B) The border ownership map of the first model interpretation of the image (A). The object is located on the right side of the edge that is between the white and the black area in the input image. In all border ownership maps shown in following figures, the edges in the input image are marked by green lines for reference. The border ownership vectors with a value above 80% of the maximum border ownership vector value in the current map are colored magenta. Other border ownership vectors are black. The small red crosses depict the discrete grid of the input image. Note that only part of the border ownership map is shown, in order to make the view clearer. (C) The second model interpretation represents an object on the left side of the boundary between the white and the black regions in the input image (A). (D) Input image with white square 8 × 8 pixels on black background. (E) The first model interpretation of the image in (D) represents a white square object on black background. The interpretation has a lowest cost −117. (F) The second model interpretation of the image in (D) represents a black frame with a square hole through which a white background is seen. This interpretation has cost−102, higher than the first interpretation, meaning it is less probable. (G) Input image of a C-shaped object. A similar image was applied in the original study of border ownership neurons (Zhou et al., 2000). (H) The first model interpretation of image (G) represents a C-shaped object. (I) The second model interpretation of image (G) represents a C-shaped hole in a frame.

Both predictions are consist with psychophysical findings (Rubin, 2001). Detecting different plausible solutions of a problem by finding multiple local minima of the functional is a novel approach.

There are numerous models that predict illusory contours in the Kanizsa square image (Williams and Hanson, 1994; Heitger et al., 1998; Kogo et al., 2002; Ron and Spitzer, 2011). The presented model approach, however, is essentially different from most of the models, since it is not oriented to detect illusory contours or locations of object occlusion. The model defines general preference rules of object boundaries and finds a stable minimizer to these rules. The illusory contours come out "by the way" as the minimizer of the problem. Since the essential approach of the model is the prediction of illusory contours based on general boundary detection approach, the model results cannot be compared to models that use specific mechanism of constructing illusory contours. The fact that the model does not use a general boundary detection approach is manifested by extraction of special image features.

Most of the existing models do extract special image features. For example, Madarasmi et al. (1994) use stochastic minimization of a functional to predict real and illusory contours of objects at different depth planes. The model is successfully applied to Kanizsa square illusion, where it detects both the illusory square and the overlapped inducer objects. The model, however, extracts special image features, namely L and T junctions, and only a single image interpretation is predicted. On the other hand, the model of Kass et al. (1988) detects real and illusory contours using energy minimizing splines. The model does not require special features extraction and both edge induced and line-end induced illusory contours are detected. However, the model is not fully automatic, since user interaction is required to draw the initial contour, section Introduction. In addition, only a single image interpretation is predicted in their model.

The functional optimization is usually used to obtain the best solution to a problem and only the global minimum is considered important (Figueiredo et al., 2003). Local minima are often considered to be disruptive and efforts are made to avoid them (Lee, 1995). The idea of a functional that has multiple minima is strongly related to the Gestalt psychology concept of Pragnantz: a simple and stable grouping (Koffka, 1935). Since the simplicity is measured by the cost functional, a local minimum of the functional indeed represents a simple and stable interpretation. Moreover, the values of the functional achieved at the different minima provide a general method, to compare the solutions at these minima. The multiple interpretations of the image are found in our model as the multiple stable minima of a functional. Thus, expressing multiple plausible solutions of a problem as multiple local minima of a functional is a new approach in the framework of functional optimization.

half the size of the illusory square). (B) The first model interpretation representing the square object, with partially illusory contours, occluding four circular objects. (C) Zoom-in into illusory boundary region between two pacmans, marked with dotted square in (B).

The method used to avoid minima that were already found in a functional section Finding multiple local minima, is related to the filled function method (Renpu, 1990), which has been used to find the global minimizer of a functional. In their method, an identified local minimum is replaced by a maximum in the

FIGURE 11 | An optimization test showing that a square object can be determined from the border ownership map, found by the model. The object extraction is for the first interpretation of Kanizsa square with support ratio 0.57 (Figure 10). Four optimization stages at different number of iterations are shown. In the images, the white region is the object at the depicted iteration. The green lines show the input image edges, which are shown for reference.

functional. The main difference between the methods is the nature of the change in the function. The filled function depends on the functional in a complicated way, while in the proposed method the repulsive term is just added to the cost functional. In addition, our minimization is always initiated from the same point, while according to their method it requires trial over a set of directions, which is less efficient computationally.

The level set approach method section Retrieving object shape by contour evolution, can be used not only to find the top-most object boundary, but also the boundary of additional objects. To perform this, the initial small object should be placed adjacent to part of the boundary of the other object. This can enable us, for example, to complete the boundary of an occluded object.

The constants in the model were chosen by trial and error. Since the presented model proposes new a approach to the boundary detection task and contains a lot of complexity at this stage already, it is hard to also make it a fully robust model. Previous new conception models also did not supply a parameter sensitivity test at the first stage (Geiger et al., 1996). In any case, the same set of parameters were used for all experiments, hence we assume and experienced that the model is not very sensitive to parameter choice.

The proposed proof of concept model is restricted to gray scale images with solid non-textured regions and without lines. The model in its current version is not applicable yet for contour integration and detection of illusory lines such as defined by abutted gratings, since the model does not include components dealing with lines or texture. Dealing with such type of images will require us to extend the measure of "description length" in the functional (van Tuijl, 1975) to include textured regions. It is very interesting to compare the model to available psychophysical data, like classification images obtained from human participants (Murray et al., 2005), however this is currently out of scope of the presented preliminary model.

Future work is planned to develop a robust model for object detection in real-world images. For this purpose, the object boundary based approach of current model should probably be replaced by an area based approach. We expect that this change will make the model much simpler, since, for example, matching the image by regions does not require even extraction of edges in

#### REFERENCES


the image. This change can also enable us to account for region based effects in the Kanizsa illusion (Kanizsa, 1976; Grossberg and Mingolla, 1987; Spehar, 2000; Ron and Spitzer, 2011).

# AUTHOR CONTRIBUTIONS

AY developed and tested the model. HS supervised the work, made contributions to the model and reviewed the paper.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fncom. 2018.00106/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Yankelovich and Spitzer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# In Praise of Artifice Reloaded: Caution With Natural Image Databases in Modeling Vision

#### Marina Martinez-Garcia1,2, Marcelo Bertalmío<sup>3</sup> and Jesús Malo<sup>1</sup> \*

1 Image Processing Lab, Universitat de València, Valencia, Spain, <sup>2</sup> CSIC, Instituto de Neurociencias, Alicante, Spain, <sup>3</sup> Departamento de Tecnologías de la Información y las Comunicaciones, Universidad Pompeu Fabra, Barcelona, Spain

Subjective image quality databases are a major source of raw data on how the visual system works in naturalistic environments. These databases describe the sensitivity of many observers to a wide range of distortions of different nature and intensity seen on top of a variety of natural images. Data of this kind seems to open a number of possibilities for the vision scientist to check the models in realistic scenarios. However, while these natural databases are great benchmarks for models developed in some other way (e.g., by using the well-controlled artificial stimuli of traditional psychophysics), they should be carefully used when trying to fit vision models. Given the high dimensionality of the image space, it is very likely that some basic phenomena are under-represented in the database. Therefore, a model fitted on these large-scale natural databases will not reproduce these under-represented basic phenomena that could otherwise be easily illustrated with well selected artificial stimuli. In this work we study a specific example of the above statement. A standard cortical model using wavelets and divisive normalization tuned to reproduce subjective opinion on a large image quality dataset fails to reproduce basic cross-masking. Here we outline a solution for this problem by using artificial stimuli and by proposing a modification that makes the model easier to tune. Then, we show that the modified model is still competitive in the large-scale database. Our simulations with these artificial stimuli show that when using steerable wavelets, the conventional unit norm Gaussian kernels in divisive normalization should be multiplied by high-pass filters to reproduce basic trends in masking. Basic visual phenomena may be misrepresented in large natural image datasets but this can be solved with model-interpretable stimuli. This is an additional argument in praise of artifice in line with Rust and Movshon (2005).

Keywords: natural stimuli, artificial stimuli, subjective image quality databases, wavelet + divisive normalization, contrast masking

# 1. INTRODUCTION

In the age of big data one may think that machine learning applied to representative databases will automatically lead to accurate models of the problem at hand. For instance, the problem of modeling the perceptual difference between images showed up in the discussion of eventual challenges at the NIPS-11 Metric Learning Workshop (Shakhnarovich et al., 2011). However, despite its interesting implications in visual neuroscience, the subjective metric of the image space

#### Edited by:

Hedva Spitzer, Tel Aviv University, Israel

#### Reviewed by:

Sophie Wuerger, University of Liverpool, United Kingdom Kendrick Norris Kay, University of Minnesota Twin Cities, United States

> \*Correspondence: Jesús Malo jesus.malo@uv.es

#### Specialty section:

This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience

Received: 21 December 2017 Accepted: 07 January 2019 Published: 18 February 2019

#### Citation:

Martinez-Garcia M, Bertalmío M and Malo J (2019) In Praise of Artifice Reloaded: Caution With Natural Image Databases in Modeling Vision. Front. Neurosci. 13:8. doi: 10.3389/fnins.2019.00008 was dismissed as a trivial regression problem because there are subjectively-rated image quality databases that can be used as training set for supervised learning.

Subjective image and video quality databases (such as VQEG, LIVE, TID, CID, CSIQ)<sup>1</sup> certainly are a major source of raw data on how the visual system works in naturalistic environments. These databases describe the sensitivity of many observers to a wide range of distortions (of different nature and with different suprathreshold intensities) seen on top of a variety of natural images. These databases seem to open a number of possibilities to check the models in realistic scenarios.

Following a tradition that links the image quality assessment problem in engineering with human visual system models (Sakrison, 1977; Watson, 1993; Wang and Bovik, 2009; Bodrogi et al., 2016), these subjectively rated image databases have been used to fit models coming from classical psychophysics or physiology (Watson and Malo, 2002; Laparra et al., 2010; Malo and Laparra, 2010; Bertalmio et al., 2017). Given the similarity between these biological models (Carandini and Heeger, 2012) and feed-forward convolutional neural nets (Goodfellow et al., 2016), an interesting analogy is possible. Fitting the biological models to reproduce the opinion of the observers in the database is algorithmically equivalent to the learning stage in deep networks. This deep-learning-like use of the databases is a convenient way to train a physiologically-founded architecture to reproduce a psychophysical goal (Berardino et al., 2017; Laparra et al., 2017; Martinez-Garcia et al., 2018). When using these biologically-founded approaches, the parameters found have a straightforward interpretation as for instance the frequency bandwidth of the system or the extent of the interaction between sensors tuned to different features.

On the other hand, pure machine-learning (data-driven) approaches have also been used to predict subjective opinion. In this case, after extracting features with reasonable statistical meaning or perceptual inspiration, generic regression techniques are applied (Moorthy and Bovik, 2010, 2011; Saad et al., 2010, 2012, 2014), even though this regression has no biological grounds.

## 1.1. Eventual Problems With Databases

The problem with the above uses of naturalistic image databases is the conventional concern about training sets in machine learning: is the training set a balanced representation of the range of behaviors to be explained?

If it is not the case, the resulting model may be biased by the dataset and it will have generalization problems. This overfitting risk has been recognized by the authors of image quality metrics based on generic regression (Saad et al., 2012). Perceptually meaningful architectures impose certain constraints on the flexibility of the model, as opposed to generic regressors. These constraints could be seen as a sort of Occam's Razor in favor of lower-dimensional models. However, even in the biologically meaningful cases, there is a risk that the model found by fitting the naturalistic database misses well-known texture perception facts.

Accordingly, Laparra et al. (2010) and Malo and Laparra (2010) used artificial stimuli after the learning stage to check the Contrast Sensitivity Function and some properties of visual masking. Similarly, in Ma et al. (2018) after training the deep network in the dataset they have to show model-related stimuli to human observers to check if the results are meaningful (and discard eventual over-fitting).

# 1.2. The Regression Hypothesis Questioned

In this work we question the hypothesis suggested at the NIPS Metric Learning Workshop (Shakhnarovich et al., 2011) that assumes that pure regression on naturalistic databases will lead to sensible vision models.

Of course, training whatever regression model with subjectively rated natural images to predict human opinion is a perfectly fine approach to tackle the restricted image quality problem. Actually, sometimes disregarding any prior knowledge about how the visual system works is seen as a plus (Bosse et al., 2018): the quantitative solution to this specific problem may gain nothing from understanding the elements of a successful regression model in terms of properties of actual vision mechanisms.

However, from a broader perspective, models intended to understand the behavior of the visual system should be more ambitious: they should be interpretable in terms of the underlying mechanisms and be able to reproduce other behavior. Our message here is that large-scale naturalistic databases should not be the only source of information when trying to fit vision models. Given the high dimensionality of the image space, it is very likely that some basic phenomena (e.g., the visibility of certain distortions in certain environments) are under-represented in the database. As a result, the model is not forced to reproduce these under-represented phenomena. And more importantly, the use of model-interpretable artificial stimuli is useful to determine the values of specific parameters in the model.

In particular, we study a specific example of the generalization risk suggested above and the benefits of model-based artificial stimuli. We show that a wavelet+divisive normalization layer of a standard cascade of linear+nonlinear layers fitted to maximize the correlation with subjective opinion on a large image quality database (Martinez-Garcia et al., 2018), fails to reproduce basic cross-masking. Here we point out the problem and we outline a solution using well selected artificial stimuli. Then, we show that the model corrected to account for these extra artificial tests is also a competitive explanation for the large-scale naturalistic database. This example is interesting because showing convincing Maximum Differentiation stimuli, as done in Berardino et al. (2017), Martinez-Garcia et al. (2018), and Ma et al. (2018), may not be enough to guarantee that the model reproduces related behaviors and points out the need to explicitly check with artificial stimuli.

<sup>1</sup>A non exhaustive list of references and links to subjective quality databases includes (Webster et al., 2001; Ponomarenko et al., 2009, 2015; Larson and Chandler, 2010; Pedersen, 2015; Ghadiyaram and Bovik, 2016).

# 1.3. In Praise of Artifice: Interpretable Models and Interpretable Stimuli

In line with Rust and Movshon (2005), our results in this work, namely pointing out the misrepresentation of basic visual phenomena in subjectively-rated natural image databases and the proposed procedure to fix it, are additional arguments in praise of artifice: the artificial model-motivated stimuli in classical visual neuroscience are helpful to (a) point out the problems that remain in models fitted to natural image databases, and (b) to suggest intuitive modifications of the models.

Regarding interpretable models, we propose a modification for the considered Divisive Normalization (Carandini and Heeger, 2012) that stabilizes its behavior. As a result of this stabilization, the model is easy to tune (even by hand) to qualitatively reproduce cross-masking. Interestingly, as a consequence of this modification and analysis with artificial stimuli, we show that the conventional unit-norm kernels in divisive normalization may have to be re-weighted depending on the selected wavelets.

It is important to note that the observations made in this work are not restricted to the specific image quality problem. Following seminal ideas based on information theory (Attneave, 1954; Barlow, 1959), theoretical neuroscience considers explanations of sensory systems based on statistical learning as alternative to physiological and psychophysical descriptions (Dayan and Abbott, 2005). Therefore, the points made below on natural image datasets, artificial stimuli from interpretable models, and optimization goals in statistical learning, also apply to a wider range of computational explanations.

The paper is organized as follows: section 2 describes the visual stimuli and introduces the cortical models considered in the work. First it illustrates the intuition that can be obtained from proper artificial stimuli as opposed to the notso-obvious interpretation of natural stimuli. Then, it presents the structure of wavelet-like responses in V1 cortex and two standard neural interaction models: **Model A** (intra-band), and **Model B** (inter-band). Section 3 shows that despite **Model A** is tuned to maximize the correlation with subjective opinion in a large-scale naturalistic image quality database it fails to reproduce basic properties of visual masking. Simulations with artificial stimuli allow intuitive tuning of **Model B** to get the correct contrast response curves while preserving the success on the large-scale naturalistic database. Finally, as suggested by the failure-and-solution example considered in this work, in section 4 we discuss the opportunities and precautions of the use of natural image databases to fit vision models, and the relevance of artificial stimuli based on interpretable models.

# 2. MATERIALS AND METHODS

Here we present the visual stimuli and the cortical interaction models considered throughout the work. The use of modelinspired artificial stimuli is critical to point out the limitations of simple models and to tune the parameters of more general models.

# 2.1. Natural vs. Artificial Stimuli

**Figure 1** shows a representative subset of the kind of patterns subjectively rated in image quality databases. This specific example comes from the TID2008 database (Ponomarenko et al., 2008). In these databases, natural scenes (photographic images with uncontrolled content) are corrupted by noise sources of different nature. Some of the noise sources are stationary and signal independent, while others are spatially variant and depend on the background. Ratings depend on the visibility of the distortion seen on top of the natural background. The considered distortions come in different suprathreshold intensities. In some cases these intensities have controlled (linearly spaced) energy or contrast, but in general, they come from arbitrary scales. Examples include different compression ratio or color quantization coarseness with no obvious psychophysical meaning. This is because the motivation of the original databases (e.g., VQEG or LIVE) was the assessment of distortions occurring in image processing applications (e.g., transmission errors in digital communication) and not necessarily to be a tool for vision science. More recent databases include more accurate control of luminance and color of both the backgrounds and the distortions (Pedersen, 2015), or report the intensities of the distortions in JND units (Alam et al., 2014). Perceptual ratings in such diverse sets certainly provide a great ground truth to check vision science models in naturalistic conditions.

However, the result of such variety is that the backgrounds and the tests seen on top have no clear interpretation in terms of specific perceptual mechanisms or controlled statistics in a representation with physiological meaning. Even though not specifically directed against subjectively rated databases, this was also the main drawback pointed out in Rust and Movshon (2005) against the use of generic natural images in vision science experiments.

In this work we go a step further in that criticism: due to the uncontrolled nature of the natural scenes and the somewhat arbitrary distortions found in these databases, the different aspects of a specific perceptual phenomenon are not fully represented in the database. Therefore, these databases should be used carefully when training models because this misrepresentation will have consequences when fitting the models.

For instance, let's consider pattern masking (Foley, 1994; Watson and Solomon, 1997). It is true that some distortions in the databases introduce relatively more noise in high contrast regions, which seems appropriate to illustrate masking. This is the case of the JPEG or JPEG2000 artifacts, or the so called masked noise in the TID database. See for instance the third example in the first row of **Figure 1**. These deviations on top of high contrast regions are less visible than equivalent deviations of the same energy on top of flat backgrounds. This difference in visibility is due to the inhibitory effect of surround in masking (Foley, 1994; Watson and Solomon, 1997). Actually, perceptual improvements of image coding standards critically depend on using better masking models that allow using less bits in those regions (Malo et al., 2000a, 2001, 2006; Taubman and Marcellin, 2001). Appropriate prediction of the visibility of these distortions in the database should come from an accurate

constitute the ground truth that should be predicted by vision models from the variation of the responses due to the distortions.

model of texture masking. However, a systematic set of examples illustrating the different aspects of masking is certainly not present in the databases. For example, there are no stimuli showing crossmasking between different frequencies in different backgrounds. Therefore, this phenomenon is under-represented in the database.

Such basic texture perception facts can be easily illustrated using artificial stimuli. Artificial stimuli can be designed with a specific perceptual phenomenon in mind, and using patterns which have specific consequences in models, e.g., stimulation of certain sensors of the model. Model/phenomenon-based stimuli is the standard way in classical psychophysics and physiology. **Figure 2** is an example of the power of well controlled artificial stimuli: it represents a number of major texture perception phenomena in a single figure.

This figure shows two basic tests (low-frequency vertical and high-frequency horizontal) of increasing contrast from left to right. These series of tests are, respectively, shown on top of (a) no background, and (b) on top of backgrounds of controlled frequency and orientation.

First, of course we can see that the visibility of the tests (or the response of the mechanisms that mediate visibility) increases with contrast, from left to right. This is why even the trivial Euclidean distance between the original and the distorted images is positively correlated with subjective opinion of distortion.

Second, the visibility, or the responses, depend(s) on the frequency of the test. Note that the lower frequency test is more visible than the high frequency test at reading distance. This illustrates the effect of the Contrast Sensitivity Function (Campbell and Robson, 1968).

Third, the response increase is non-linear with contrast. Note that for lower contrasts (e.g., from the second picture to the third in the series) the increase in visibility is bigger than for higher contrasts (e.g., between the pictures at the right-end). This means that the slope of the mechanisms mediating the response is high for lower amplitudes and saturates afterwards. This sort of Weber-like behavior for contrast is a distinct feature of contrast masking (Legge, 1981).

Finally, the visibility (or response) decreases with the background energy depending on the spatio-frequency similarity between test and background. Note for instance that the low frequency test is less visible on top of the low frequency background than on top of the high frequency background. Important for the example considered throughout this paper, note that the visibility of the high frequency test behaves the other way around: it is bigger on top of the low frequency test. Moreover, this masking effect is bigger for bigger contrasts of the background. This adaptivity of the nonlinearity is a distinct feature of the masking effect (Foley, 1994; Watson and Solomon, 1997), and more importantly, it is a distinct feature of real neurons (Carandini and Heeger, 1994, 2012) with regard to the simplified neurons used in deep learning (Goodfellow et al., 2016).

As a result, just by looking at **Figure 2**, one may imagine how the visibility (or response) curves vs. the contrast of the test should be for the series of stimuli presented. **Figure 3** shows an experimental example of the kind of response curves obtained in actual neurons in masking situations. Note the saturation of the response curves and how they are attenuated when the background is similar to the test. Even this qualitative behavior highlighted in green (saturation and attenuation) may be used to discard models that do not reproduce the expected behavior, i.e., that do not agree with what we are seeing.

More importantly, the relative visibility of these artificial stimuli can also be used to intuitively tune the parameters of a model to better reproduce the visible behavior. This can be done because these artificial stimuli were crafted to have a clear interpretation in a standard model of texture vision: a set of V1 like wavelet neurons (oriented receptive fields tuned to different frequency scales). **Figure 4** illustrates this fact: note how the test patterns considered in the figure mainly stimulate a specific

background (cross-masking).

subband of a 3-scale 4-orientation steerable wavelet pyramid (Simoncelli et al., 1992), which is a commonly used model of V1 sensors. As a result, it is easy to select the set of sensors that will drive the visibility descriptor in the model: see the highlighted wavelet coefficients in the diagrams at the right of **Figure 4**.

The same intuitive energy distribution over the pyramid is true for the backgrounds, which stimulate the corresponding subband (scale and orientation). As a result, given the distribution of test and backgrounds in the pyramid, it is easy to propose intuitive cross-band inhibition schemes to lead to the required decays in the response.

The intuitions obtained from artificial model-oriented stimuli about response curves and eventual-crossmasking schemes are fundamental both to criticize the results obtained from blind learning from a database, and to propose intuitive improvements of the model.

#### 2.2. Cortical Interaction Models: Structure and Response

In this work we analyze the behavior of standard retina-cortex models that follow the program suggested in Carandini and Heeger (2012) i.e., cascades of isomorphic linear+nonlinear layers, each focused on a different psychophysical factor:



This family of models represents a system, S, that depends on some parameters, 2, and applies a series of transforms on the input radiance vector, **x** 0 , to get a series of intermediate response vectors, **x** i ,

Each layer in this sequence accounts for the corresponding psychophysical phenomenon outlined above and is the concatenation of a linear transform L and a nonlinear transform N :

Here, in each layer we use convolutional filters for the linear part and the canonical Divisive Normalization for the nonlinear part. The mathematics of this type of models required to set their parameters are detailed in Martinez-Garcia et al. (2018).

In this kind of models the psychophysical behavior (visibility of a test) is obtained from the behavior of individual units (increment of responses) through some sort of summation. The visibility of a test, 1**x** 0 , seen on top of a background, **x** 0 , is given by the perceptual distance between background and background+test. Specifically, this perceptual distance, dp, may be computed through the q norm of the vector with the increment of responses in the last neural layer (Watson and Solomon, 1997; Laparra et al., 2010; Martinez-Garcia et al., 2018). In the 4-layer model of Equation 1, we have k1**x** <sup>4</sup>kq:

$$d\_{\mathbb{P}}(\mathbf{x}^0, \mathbf{x}^0 + \Delta \mathbf{x}^0) = \|\Delta \mathbf{x}^4\|\_{q} = \left(\sum\_{j} |\Delta \mathbf{x}\_j^4|^q\right)^{\frac{1}{q}}\tag{3}$$

There is a variety of summation schemes: one may choose to use different summation exponents for different features (e.g., splitting the sum over j in space, frequency, and orientation), and order of summation matters if the exponents for the different features are not the same. Besides, there is no clear consensus on the value of the summation exponents either (Graham, 1989): the default quadratic summation choice, q = 2 (Teo and Heeger, 1994; Martinez-Garcia et al., 2018), has been questioned proposing bigger (Watson and Solomon, 1997; Laparra et al., 2010) and smaller (Laparra et al., 2017) summation exponents.

More important than all the above technicalities, the key points in Equation (3) are: (a) it clearly relates the visibility with the response of the units, and (b) for q ≥ 2, the visibility is Martinez-Garcia et al. In Praise of Artifice Reloaded

driven by the response of the units that undergo bigger variation, |1x 4 j |, such as the ones highlighted in **Figure 4**. Therefore, in this kind of models, analyzing the visibility curves or the response curves of the units tuned to the test is qualitatively the same. In the simulations we do the latter since we are interested in direct observation of the effect of the interaction parameters on the curves; and this is more clear when looking at the response of selected subsets of units as those highlighted in **Figure 4**.

In this work we compare two specific examples of this family of models. These two models will be referred to as **Model A** and **Model B**. They have identical layers 1–3, and they only differ in the nonlinear part of the fourth layer: the stage describing the interaction between cortical oriented receptive fields. In **Model A** we only consider interactions between the sensors tuned to the same subband (scale and orientation) because we proved that this simple scheme is appropriate to obtain good performance in subjectively rated databases (Laparra et al., 2010; Malo and Laparra, 2010). In **Model B** on top of the intra-band relation we also considered inter-band relations according to a standard unitnorm Gaussian kernel over space, scale and orientation (Watson and Solomon, 1997). Additionally to the classical inter-band generalization we also included extra weights and a stabilization constant that makes the model easier to understand. The software implementing **Model A** and **Model B** is available at "http://isp.uv. es/docs/BioMultiLayer\_L\_NL\_a\_and\_b.zip".

Let's consider the differences between the models in more detail. Assuming that the output of the wavelet filter-bank is the vector **y**, and assuming that the vector of energies of the coefficients is obtained by coefficient-wise rectification and exponentiation, **e** = |**y**| γ , the vector of responses after divisive normalization in the last layer of **Model A** is:

$$\mathfrak{x} = \text{sign}(\mathfrak{y}) \odot \frac{\mathfrak{e}}{\mathfrak{b} + H \cdot \mathfrak{e}} \tag{4}$$

where ⊙ stands for element-wise Hadamard product and the division is also an element-wise Hadamard quotient where the energy of each linear response is divided by a linear combination of the energies of the neighboring coefficients in the wavelet pyramid. This linear combination (that attenuates the response) is given by the matrix-on-vector product H · **e**. Note that, for simplicity, in Equation 4 we omitted the indices referring to the 4th layer [as opposed to the more verbose formulation in the Appendix (**Supplementary Material**)].

The i-th row of this matrix, H, tells us how the responses of neighbor sensors in the vector **e** attenuate the response of the i-th sensor in the numerator, e<sup>i</sup> . The attenuating effect of these linear combinations is moderated by the semisaturation constants in vector **b**.

The structure of these vectors and matrices is relevant to understand the behavior on the stimuli. First, one must consider that all the vectors, **y**, **e**, and **x**, have wavelet-like structure. **Figure 4** shows this subband structure for specific artificial stimuli and **Figure 5** shows it for natural stimuli.

The i-th coefficient has a 4-dimensional spatio-frequency meaning, i ≡ (**p**<sup>i</sup> , fi , φi), where **p** is a two-dimensional location, f is the modulus of the spatial frequency, and φ is orientation.

In **Model A** we only consider Gaussian intra-band relations. This means that interactions in H decay with spatial distance and it is zero between sensors tuned to different frequency and orientation. This implies a block-diagonal structure in H with zeros in the off-diagonal blocks. In Martinez-Garcia et al. (2018) the norm or each Gaussian neighborhood (or row) in H was optimized to maximize the correlation with subjective opinion.

It is important to stress that the specific distribution of responses of natural images over the subbands of the response vector (green line in **Figure 5**) is critical to reproduce the good behavior of the model on the database. Note that this is not a regular (linear) wavelet transform, but the (nonlinear) response vector. Therefore, this distribution tells us both about the statistics of natural images and about the behavior of the visual system. On the one hand, natural images have relatively more energy in the low-frequency end. But, on the other hand, it is visually relevant that the response of sensors tuned to the high frequency details is much lower than the response of the sensors tuned to the low frequency details. The latter is in line with the different visibility of the artificial stimuli of different frequency shown in **Figure 2**, and it is probably due to the effect of the Contrast Sensitivity Function (CSF) in earlier stages of the model. This is important because keeping this relative magnitude between subbands is crucial to have good alignment with subjective opinion in the large-scale database.

In the case of **Model B**, we consider (a) a more general interaction kernel in the divisive normalization, and (b) a constant diagonal matrix to control the dynamic range of the responses. Specifically, the vector of responses is:

$$\mathfrak{x} = \operatorname{sign}(\mathfrak{y}) \odot \left[ \mathfrak{x} \odot \frac{\mathfrak{b} + H\_G \cdot \mathfrak{e}^\star}{\mathfrak{e}^\star} \right] \odot \frac{\mathfrak{e}}{\mathfrak{b} + H\_G \cdot \mathfrak{e}}.\tag{5}$$

Here the response still follows a nonlinear divisive normalization because **e** ⋆ is just a fixed vector (not a variable), and hence the term in brackets is just another constant vector. In **Model B**, following Watson and Solomon (1997), we consider a generalized interaction kernel H<sup>G</sup> that consists of separable Gaussian functions which depend on the distance between the location of the sensors, H**p**, and on the difference between their scales and orientations, H<sup>f</sup> and Hφ. Moreover, we extend the unitnorm Gaussian kernel already proposed in Watson and Solomon (1997) with additional weights in case extra inter-band tuning is needed:

$$H\_G = \mathbb{D}\_{\mathfrak{C}} \cdot \left[ H\_{\mathfrak{P}} \odot H\_{\mathfrak{f}} \odot H\_{\mathfrak{\Phi}} \odot \mathrm{C\_{\mathrm{int}}} \right] \cdot \mathbb{D}\_{\mathfrak{w}},\tag{6}$$

where Cint is a subband-wise full matrix, D**<sup>w</sup>** is a diagonal matrix with vector **w** in the diagonal, and the normalization of each row of the kernel is controlled by a diagonal matrix D**c**, which contains the vector of normalization constants, **c**, in the diagonal. This means that the elements c<sup>i</sup> normalize each interaction neighborhood, and the elements w<sup>j</sup> control the relative relevance of the energies e<sup>j</sup> before these are considered for the interaction.

In addition to the generalized kernel, the other distinct difference of **Model B** is the extra constant K(**e** ⋆ h ) = κ ⊙ **b**+HG·**e** ⋆ **e** ⋆ i . This constant has a relevant qualitative rationale:

Adelson, 1990), data in the vector are organized from high-frequency (fine scales at the left) to low-frequency (coarse scales at the right), wavelet vector (E). Abscissas indicate the wavelet coefficient. The specific scale of the ordinate axis is not relevant. Solid vertical lines in red indicate the limits of the different scales. Within each scale, the dashed lines in pink indicate the limits of the different orientations. The different coefficients within each scale/orientation block correspond to different spatial locations. The line in green shows the average amplitude per subband for a set of natural images. As discussed in the text, this specific energy distribution per scale is relevant for the good performance of the model.

it keeps the response bounded regardless of the choice for the other parameters.

Note that, when the input energy, **e**, arrives to the reference value, **e** ⋆ , the response of **Model B** reduces to the vector κ regardless of model parameters. This simplifies the qualitative control of the dynamic range of the system because one may set a desired output κ (e.g., certain amplitudes per subband) for some relevant reference input **e** ⋆ regardless of the other parameters. This stabilization constant, K(**e** ⋆ ), does not modify the qualitative effect of the relevant parameters of the divisive normalization, but, as it constraints the dynamic range, it allows the modeler to freely play with the relevant parameters γ , **b**, and HG, and still preserve the relative amplitude of the subbands. And this freedom is particularly critical to understand the kind of modifications needed in the parameters to reproduce certain experimental trend.

Here we propose that **e** ⋆ is related to the average energy of the input to this nonlinear neural layer. Similarly, we propose to set the global scaling factor, κ, according to a desired dynamic range in the output of this neural layer. These stabilization settings simplify the use of the model thus allowing to get the desired qualitative behavior even modifying the parameters by hand. Interestingly, this freedom to explore will reveal the modulation required in the conventional unit-norm Gaussian kernel.

## 3. RESULTS

In this section we show the performance of **Model A** and **Model B** in two scenarios: (a) reproducing subjective opinion in large-scale naturalistic databases using quadratic summation in Equation 3, and (b) obtaining meaningful contrast response curves for artificial stimuli.

The parameters of **Model A** are those obtained in Martinez-Garcia et al. (2018) to provide the best possible fit to the mean opinion scores on a large natural image database. These parameters of **Model A** are kept fixed throughout the simulations in this section. On the contrary, in the case of **Model B**, we start from a base-line situation in which we import the parameters from **Model A**, but afterwards, this naive guess is fine tuned to get reasonable response curves for the artificial stimuli considered above. Our goal is checking if the models account for the trends of masking described in **Figures 2**, **3**: we are not fitting actual experimental data but just refuting models that do not follow the qualitative trend.

In this model verification context, the fine tuning of **Model B** is done by hand: we just want to stress that while **Model A** cannot account for specific inter-band interactions, the interpretability of **Model B** when using the proper artificial stimuli makes it very easy to tune. And this intuitive tuning is possible thanks to the stabilization effect of the constant K(**e** ⋆ ) proposed above.

Nevertheless, it is important to stress that the Jacobian with regard to the parameters of **Model B** given in appendix (**Supplementary Material**) are implemented in the code associated to the paper. Therefore, despite the exploration of the responses in this section will be just qualitative, the code of **Model B** is ready for gradient descent tuning if one decides to measure the contrast incremental thresholds for the proper artificial stimuli.

Accurate control of spatial frequency, luminance, contrast and appropriate rendering of artificial stimuli can be done using the generic routines of VistaLab (Malo and Gutiérrez, 2014). In order to do so, one has to take into account a sensible sampling frequency (e.g., bigger than 60 cpd to avoid aliasing at visible frequencies) and the corresponding central frequencies and orientations of the selected wavelet filters in the model. The specific software used in this paper to generate the stimuli and to compute the response curves is available at: "http://isp.uv.es/ docs/ArtificeReloaded.zip".

# 3.1. Success of "Model A" in Naturalistic Databases

Optimization of the width and amplitude of the Gaussian kernel, H, in each subband as well as the semisaturation parameters **b** in each subband of **Model A** led to the results in **Figure 6**. This was referred to as optimization phase I in Martinez-Garcia et al. (2018). Even though optimization phase II using the full variability in **b** led to higher correlations, here we restrict ourselves to optimization phase I because we want to keep the number of parameters small. Note that **b** has 2.5·10<sup>4</sup> elements but restricting to a single semisaturation per subband we only have 14 free parameters. In the optimization phase I only 1/25 of the TID database was used in the training.

As stated above, spatial-only intra-band relations leads to symmetric block diagonal kernels. Optimization acted on the width and amplitude of these kernels per subband. Similarly, optimization lead to bigger semisaturation for low frequencies except for the low-pass residual.

The performance of the resulting model on the naturalistic database is certainly good: compare the correlation of **Model A** with subjective opinion in **Figure 6** as opposed to the widely used Structural SIMilarity index (Wang et al., 2004), in red, considered here just as useful reference. Given the improvement in correlation with regard to SSIM, one can certainly say that **Model A** is highly successful in predicting the visibility of uncontrolled distortions seen on naturalistic backgrounds.

## 3.2. Relative Failure of "Model A" With Artificial Stimuli

Despite the reasonable formulation of **Model A** and its successful performance in reproducing subjective opinion in large-scale naturalistic databases, a simple simulation with the kind of artificial stimuli presented in section 2.1 shows that it does not reproduce all the aspects of basic visual masking.

Specifically, we computed the response curves of the highlighted neurons in **Figure 4** for low-frequency and highfrequency tests like those illustrated in **Figure 2** as a function of their contrast. We considered four different contrasts for the background. Different orientations of the background (vertical, diagonal and horizontal) were also considered.

**Figure 7** presents the results of such simulation. This figure highlights some of the good features of **Model A**, but also its shortcomings.

On the positive side we have the following. First, the response increases with contrast as expected. Second, the response for the low frequency test is bigger than the response for the high frequency test (see the scale of the ordinate axis for the high frequency response). This is in agreement with the CSF. Third, the response saturates with contrast as expected. And also, increasing the contrast of the background decreases the responses.

However, contrarily to what we can see when looking at the artificial stimuli, the response for the high frequency test does not decay more on top of high frequency backgrounds. While the decay behavior is qualitatively ok for the low-frequency test, definitely it is not ok for the high-frequency test. Compare the decays of the signal at the circles highlighted in red in **Figure 7**: the response of the sensors tuned to high-frequency test decays by the same amount when they are presented on top of lowfrequency backgrounds than when the background also has highfrequency. The model is failing here despite its good performance in the large database.

# 3.3. Success of "Model B" With Natural and Artificial Stimuli

The starting point of our heuristic exploration with **Model B** is a straightforward translation of **Model A** into **Model B**. We will refer to this as **Model B naive**. This starting point consists of importing the values of the parameters from **Model A** except for the modulations depending on the scale and orientation. Following Watson and Solomon (1997) we assumed reasonable interaction lengths of one octave (for scales) and 30 degrees (for orientation). We used no extra weights to break the symmetry (Cint = 1 is an all-ones matrix, and C**<sup>w</sup>** = I is the identity). And the values for**c** and **b** also come from **Model A**. The parameters of this **Model B naive** are shown in **Figure 8** (left panels). The idea of this starting point, **Model B naive**, is reproducing the behavior of **Model A** to build on from there.

Results in **Figure 9** (top) and **Figure 10** (left) show that **Model B naive** certainly reproduces the behavior of **Model A**: both the success in the natural image database and the relative failure with artificial stimuli.

On top of kernel generalization, there is a second relevant intuition: modifications in the kernel may be ineffective if the semisaturation constants are too high. Note that the denominator of Divisive Normalization, Equation 4, is a balance between the linear combination H · **e** and the vector **b**. This means that some elements of **b** should be reduced for the subbands where we want to act. Increasing the corresponding elements of vector **c**, leads to a similar effect.

With these intuitions one can start playing with H<sup>G</sup> and **b**. However, while the effect of the low-frequency is easy to reduce using the above ideas (thus solving the problem highlighted in red in **Figure 7**), the relative amplitude between the responses to low and high frequency inputs is also easily lost. This quickly ruins the low-pass CSF-like behavior and reduces the performance on the large-scale database. We should not lose the relative amplitudes of the responses of **Model A** to natural images (i.e., green lines in **Figure 5**) to keep its good performance. Unfortunately **Model A** is unstable under this kind of modifications making it difficult to tune. That is why it is necessary to include the constant h κ ⊙ **b**+HG·**e** ⋆ **e** ⋆ i in **Model B** to control the dynamic range of the responses.

**Figure 8** (right panel) shows the fine-tuned parameters according to the heuristic suggested above: reduce semisaturation in certain bands and control the amplitude of the kernel in certain bands. This heuristic comes from the meaning of the blocks in the kernel and from the subbands that are activated by the different artificial stimuli. Note that we strongly reduced **b** and we applied bigger reductions for the high-frequency bands (which corresponds to the sensors we want to fix). In the same vein we increased the values for the global scale of the kernels of high frequencies **c** while reducing substantially these amplitudes for low-frequencies to preserve previous behavior, which was ok for low-frequencies. Finally, and more importantly, we moderated the effect of the low-frequencies in masking by using small weights for the low-frequency scales in **w**, while increasing the values for high frequency. Note how this reduces the columns corresponding to the low-frequency subbands in the final kernel HG, and the other way around for the high-frequency scales. This implies a bigger effect of high-frequency backgrounds in the attenuation of high-frequency sensors and reduces the effect of the low-frequency.

Results in **Figure 9** show that this fine-tuning fixes the qualitative problem detected in **Model A**, which was also present in **Model B naive**. We successfully modified the response of highfrequency sensors: see the decay in the green circles compared to the behavior in the red circles. Moreover, we introduced no major difference in the low-frequency responses, which already were qualitatively correct.

Moreover, **Figure 10** shows that the fine-tuned version of **Model B** not only works better for artificial stimuli but it also preserves the success in the natural image database. The latter is probably due to the positive effect of setting the relative magnitude of the responses in **Model B** as in **Model A** using the appropriate K(**e** ⋆ ) (setting the output κ for the average input **e** ⋆ ).

It is interesting to stress that the solution to get the right qualitative behavior in the responses didn't require any extra weight in Cint, which remained an all-ones matrix. We only operated row-wise and column-wise with the diagonal matrices D**<sup>c</sup>** and D**w**, respectively.

In summary, in order to fix the qualitative problems of **Model A** with masking of high-frequency patterns, the obvious use of generalized unit-norm inter-band kernels, as in Watson and Solomon (1997), was not enough: we had to consider the activation of the different subbands due to controlled artificial stimuli to tune the weights in the left- and right- diagonal matrices that modulate the unit-norm Gaussian kernels H<sup>G</sup> = D**<sup>c</sup>** · - H**<sup>p</sup>** ⊙ H<sup>f</sup> ⊙ H<sup>φ</sup> · D**w**. It was necessary to include high-pass filters in **c** and **w** (see **Figure 8**, fine-tuned) to moderate the effect

FIGURE 7 | Relative success and failures of Model A optimized on the large-scale database. Model-related stimuli such as the low-frequency and high-frequency tests shown on the top panel simplify the reproduction of results form model outputs and allow simple visual interpretation of results. In this simulation the response curves at the bottom panel are computed from the variation of the responses of the low-frequency and high-frequency sensors of the 4th layer highlighted in green in Figure 4. In each case, the variation of the response is registered as the contrast of the corresponding stimulus is increased. That is why we plot 1**x** 4 vs. the contrast of the input, C. The different line styles represent the response for different contrast of the background, Cb. Simple visual inspection of the stimuli is enough to discard some of the predicted curves (e.g., those in red circles): the low frequency backgrounds do not mask the high frequency test more than the high frequency backgrounds.

of the low-frequency backgrounds on the masking of sensors tuned to high-frequencies.

The need of these extra filters can be interpreted in a interesting way: there should be a balanced correspondence between the linear filters and the interaction neighborhoods in the nonlinearity. Note that different choices for the filters to model the linear receptive fields in the cortex imply different energy distributions over the subbands<sup>2</sup> . In this situation, if the

<sup>2</sup>For instance, analyzing images by choosing Gabors or different wavelets, and by choosing different ways to sample the retinal and the frequency spaces, definitely leads to different distributions of the energy over the subbands.

energy in certain subband is overemphasized by the choice of the filters, the interaction neighborhoods should discount this fact.

Of course, more accurate tuning of **Model B** on actual exhaustive contrast incremental data of different tests+backgrounds may lead to more sophisticated weights in Cint. However, the simple toy simulation presented here using artificial stimuli with clear interpretation was enough to (a) discard **Model A**, (b) to point out the balance problem between the assumed linear cortical filters and the assumed interaction kernel in divisive normalization, and even (c) to propose an intuitive solution for the problem.

### 4. DISCUSSION

The relevant question is: is the failure of **Model A** something that we could have expected? And the unfortunate answer is, yes: the failure is not surprising given the (almost necessarily) imbalanced nature of large-scale databases. Note that it is not only that **Model A** is somewhat rigid<sup>3</sup> , the fundamental problem is that the specific phenomenon is not present in the database with enough frequency or intensity to force the model to reproduce it in the learning stage.

Of course, this problem is hard to solve because it is not obvious to decide in advance the kind of phenomena (and the right amount of each one) that should be present in the database(s): as a result, databases are almost necessarily imbalanced and biased by the original intention of the creators of the database.

Here we made a full analysis (problem and route-to-solution) on texture masking, but note that focus on masking was just one important but arbitrary example to stress the main message. There are equivalent limitations affecting other parts of the optimized model that may come from the specific features of the database. For instance, the luminance-to-brightness transform (first layer in models A and B) is known to be strongly nonlinear and highly adaptive (Wyszecki and Stiles, 1982; Fairchild, 2013). It can be modeled using the canonical divisive normalization (Hillis and Brainard, 2005; Abrams et al., 2007) but also other alternative nonlinearities (Cyriac et al., 2016), and this nonlinearity has been shown to have relevant statistical effects (Laughlin, 1983; Laparra et al., 2012; Laparra and Malo, 2015; Kane and Bertalmio, 2016). However, when fitting layers 1st and 4th simultaneously to reproduce subjective opinion over the naturalistic database in Martinez-Garcia et al. (2018), even though we found a consistent increase in correlation, in the end, the behavior for the first layer turned out to be almost linear. The constant controlling the effect of the anchor luminance turned out to be very high. As a result, the nonlinear effect of the luminance is small. Again, one of the reasons for this result may be that the low dynamic range of the database did not require a stronger nonlinearity at the front-end given the rest of the layers. Similar effects could be obtained with the nonlinearities of color channels if the statistics is biased (MacLeod, 2003; Laparra and Malo, 2015).

The case studied here is not only a praise of artificial stimuli, but also a praise of interpretable models. When models are interpretable, it is easier to fix their problems from their failures on synthetic model-interpretable stimuli. For example, the solution we described here based on considering extra interaction between the sensors is not limited to divisive models of adaptation. Following Bertalmio et al. (2017), it may be also applied to other interpretable models such as the subtractive Wilson-Cowan equations (Wilson and Cowan, 1972; Bertalmio and Cowan, 2009). In this subtractive case one should tune the matrix that describes the relations between sensors. This kind of intuitive modifications in the architecture of the models

<sup>3</sup> It is true that **Model A** only included intra-band relations, but note also that, even though we wanted to introduce more general kernels in **Model B** for future developments, the solution to the qualitative problem considered here basically came from including D**<sup>w</sup>** in H (not from sophisticated cross-subband weights). The other ingredients, **b** and **c** were already present in **Model A**.

would have been more difficult, if possible at all, with nonparametric data-driven methods. In fact, there is an active debate about the actual scientific gain of non-interpretable models, such as blind regression (Castelvecchi, 2016; Bohannon, 2017).

Finally, the masking curves considered in this paper also illustrate the fact that beyond the limitations of the database or the limitations of the architecture, the learning goal is also an issue. Note that, even using the same database and model, different learning goals may have different predictive power. For instance, other learning goals applied to natural images also give rise to cross-masking. Examples include information maximization (Schwartz and Simoncelli, 2001; Malo and Gutiérrez, 2006), and error minimization (Laparra and Malo, 2015). A systematic comparison between these different learning goals on the same database for a wide range of frequencies is still needed.

### 4.1. Consequence for Linear + Nonlinear Models: The Filter-Kernel Balance

Related to model interpretability, the results of our exploration with artificial stimuli suggests an interesting conclusion when dealing with linear+nonlinear models: matching linear filters and non-linear interaction is not trivial. Remember the wavelet-kernel balance problem described at the end of the results. Therefore, in building these models, one should not take filters and kernels off the shelf.

One may take this balance problem as another routinary parameter to tune. However, this balance problem may actually question the nature of divisive normalization in terms of other models. For instance, in Malo and Bertalmio (2018) we show that the divisive normalization may be seen as the stationary solution of lower-level Wilson-Cowan dynamics that do use a sensible unit-norm Gaussian interaction between units. This kind of questions are only raised, and solutions may be proposed, when testing interpretable models with model-related stimuli.

# 4.2. Using Naturalistic Databases Is Always a Problem?

Our criticism of naturalistic databases because their eventual imbalance and the problem in interpreting complicated stimuli in terms of models does not mean that we claim for an absolute rejection of these naturalistic databases. The case we studied here only suggests that one should not use the databases blindly as the only source of information, but in appropriate combination with well-selected artificial stimuli.

The use of carefully selected artificial stimuli may be considered as a safety-check of biological plausibility. Of course, our intention with the case studied here was not exhausting the search possibilities to claim that we obtained some sort of optimal solution. Instead, we just wanted to stress the fact that using the appropriate stimuli it is easy to propose modifications of the model that go in the right (biologically meaningful) direction, and still represent a competitive solution for the naturalistic database. This is an intuitive way to jump to other local minima which may be more biologically plausible in a very different region of the parameter space.

A sensible procedure would be alternating different learning epochs using natural and artificial data: while the largescale naturalistic databases coming from the image processing community may enforce the main trends of the system, the specific small-scale artificial stimuli coming from the vision science community will fine-tune that first order approximation so that the resulting model has the appropriate features revealed by more specific experiments. In this context, standardization efforts such as those done by the CIE and the OSA organizations are really important to make this double-check. Examples include the data supporting the standard color observer (Smith and Guild, 1931; Stockman, 2017) and the standard spatial observer (Ahumada, 1996).

From a more general perspective, image processing applications do have a fundamental interest in visual neuroscience because these applications put into a broader context the relative relevance of the different phenomena described by classical psychophysics or physiology. For instance, one can check the variations in performance by testing vision models of different complexity, e.g., with or without this or that nonlinearity accounting for some specific perceptual effect/ability. This approach oriented to check different perceptual modules in specific applications has been applied in image quality databases (Watson and Malo, 2002), but also in other domains such as perceptual image and video compression (Malo et al., 2000a,b, 2001, 2006), or in perceptual image denoising and enhancement (Gutiérrez et al., 2006; Bertalmio, 2014). These different applications show the relative relevance of improvements in masking models with regard to better CSFs or including more sensible motion estimation models in front of better texture perception models.

## 4.3. Are All the Databases Created Equal?

The case analyzed in this work illustrates the effect of (naively) using a database where texture masking is probably underrepresented. The lesson to learn is that one has to take into account the phenomena for which database was created, or, equivalently, the absence of specific phenomena to address.

With this in mind, one could imagine what kind of artificial stimuli are needed to improve the results. Or alternatively, which other naturalistic databases are required as complementary check since they are more focused on other kind of perceptual behavior.

Some examples to illustrate this point: databases with controlled observation distance or accurate chromatic calibration such as Pedersen (2015) are more appropriate to set the spatial frequency bandwidth of the models in achromatic and chromatic channels. Databases with spectrally controlled illumination pairs (Laparra et al., 2012; Gutmann et al., 2014; Laparra and Malo, 2015) are appropriate to address chromatic adaptation models. Databases with high-dynamic range (Korshunov et al., 2015; Cerda-Company et al., 2016) will be more appropriate to point out the need of the nonlinearity of brightness perception. Finally, databases where visibility of incremental patterns was carefully controlled in contrast terms (Alam et al., 2014) are the best option to fit masking models as opposed to generic subjectively-rated image distortion databases.

### 4.4. Final Remarks

Previous literature (Rust and Movshon, 2005) criticized the use of too complex natural stimuli in vision science experiments because the statistics of such stimuli are difficult to control and conclusions may be biased by the interaction between this poorly controlled input and the complexities of the neural model under consideration.

In line with such precautions on the use of natural stimuli, here we make a different point: the general criticism to blind use of machine learning in large-scale databases (related to the proper balance in the data) also applies when using subjectively rated image databases to fit vision models. Using a variety of natural scenarios and distortions cannot guarantee that specific behaviors are properly represented, thus remaining hidden in the vast amount of data. In such situation, models that seem to have the right structure may miss these basic phenomena. Instead of trying to explicitly include model-oriented artificial stimuli in the large database to fix the unbalance, it is easier to address the issue by using the model-oriented artificial stimuli in illustrative experiments specifically intended to test some parameters of the model.

The case study considered here suggests that artificial stimuli, motivated by specific phenomena or by features of the model, may help both to (a) stress the problems that remain in models fitted to imbalanced natural image databases, and (b) to suggest modifications in the models. Incidentally, this is also an argument in favor of interpretable parametric models as opposed to data-driven pure-regression models. A sensible procedure to fit general purpose vision models would be alternating different fitting strategies using (a) uncontrolled natural stimuli, but also (b) well-controlled artificial stimuli to check the biological plausibility at each point.

In conclusion, predicting subjective distances between images may be a trivial regression problem, but using these largescale databases to fit plausible models may take more than that: for instance, a vision scientist in the loop doing the proper fine-tuning of interpretable models using the classical artificial stimuli.

# AUTHOR CONTRIBUTIONS

JM conceived the work, prepared the data and code for the experiments, and contributed to the interpretation of the results and manuscript writing. MM-G ran the experiments. MB contributed to the manuscript writing and to the criticism of blind machine-learning-like approaches.

# FUNDING

This work was partially funded by the Spanish and EU FEDER fund through the MINECO/FEDER/EU grants TIN2015-71537- P and DPI2017-89867-C2-2-R; and by the European Union's Horizon 2020 research and innovation programme under grant agreement number 761544 (project HDR4EU) and under grant agreement number 780470 (project SAUCE).

# ACKNOWLEDGMENTS

This work was conceived in La Fabrica de Hielo (Malvarrosa) after the reaction of Dr. C.A. Parraga to VanRullen (2017): scientists cannot be easily substituted by machines.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2019.00008/full#supplementary-material

### REFERENCES


Graham, N. (1989). Visual Pattern Analyzers. Oxford, UK: Oxford University Press.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Martinez-Garcia, Bertalmío and Malo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# An Extreme Value Theory Model of Cross-Modal Sensory Information Integration in Modulation of Vertebrate Visual System Functions

Sreya Banerjee<sup>1</sup> , Walter J. Scheirer <sup>1</sup> and Lei Li <sup>2</sup> \*

*<sup>1</sup> Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, United States, <sup>2</sup> Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, United States*

We propose a computational model of vision that describes the integration of cross-modal sensory information between the olfactory and visual systems in zebrafish based on the principles of the statistical extreme value theory. The integration of olfacto-retinal information is mediated by the centrifugal pathway that originates from the olfactory bulb and terminates in the neural retina. Motivation for using extreme value theory stems from physiological evidence suggesting that extremes and not the mean of the cell responses direct cellular activity in the vertebrate brain. We argue that the visual system, as measured by retinal ganglion cell responses in spikes/sec, follows an extreme value process for sensory integration and the increase in visual sensitivity from the olfactory input can be better modeled using extreme value distributions. As zebrafish maintains high evolutionary proximity to mammals, our model can be extended to other vertebrates as well.

#### Edited by:

*Hagit Hel-Or, University of Haifa, Israel*

#### Reviewed by:

*Yin Tian, Chongqing University of Posts and Telecommunications, China Timothy Matthew Otchy, Boston University, United States*

\*Correspondence:

*Lei Li li.78@nd.edu*

Received: *29 August 2018* Accepted: *16 January 2019* Published: *26 February 2019*

#### Citation:

*Banerjee S, Scheirer WJ and Li L (2019) An Extreme Value Theory Model of Cross-Modal Sensory Information Integration in Modulation of Vertebrate Visual System Functions. Front. Comput. Neurosci. 13:3. doi: 10.3389/fncom.2019.00003*

Keywords: cross-modal sensory integration, statistical extreme value theory, classification, olfaction, vision, zebrafish

## 1. INTRODUCTION

The brain perceives the external world through an integration of stimuli received from different sensory modalities like vision, olfaction, and audition via the centrifugal pathway. A recent study taking inspiration from Cajal's original work on brain mapping (Gire et al., 2013) describes current knowledge of the centrifugal olfactory and visual pathways in mammalian species as being incomplete. While, for instance, the signaling pathways mediating brain feedback in human olfaction have been characterized, the origins and effects of signals to visual system functions remain to be examined. In this work, we seek to understand the modulation of the circuits between sensory modalities. A crucial observation, yielding from our own work, points to how due to olfacto-visual sensory integration, measures of visual performance or behavior in response to multi-sensory input are enhanced, when a stimulus in one modality is ambiguous or undetermined. In fact, in all vertebrate species (e.g., teleost, reptiles, birds, rodents, primates) examined thus far, the retina receives brain feedback through the centrifugal visual pathways (Harter and Aine, 1984; Mick et al., 1993; Gastiner et al., 2004). Depending on the species under consideration, the centrifugal pathways may originate from different parts of brain, such as the pre-tectal cortex, isthmo-optic nucleus, thalamus, or olfactory bulb.

In zebrafish (Danio rerio), the olfacto-retinal centrifugal (ORC) pathway originates from terminalis neurons (TNs) in the olfactory bulb (OB) and terminates in retina. TNs (**Figure 1A**) synthesize gonadotropin-releasing hormone (GnRH) as a major neurotransmitter. In the retina, TN fibers synapse with dopaminergic interplexiform cells (DA-IPCs), retinal ganglion cells (RGCs), and possibly other retinal cell types. Insights from relatively recent research (Li and Dowling, 2000; Huang et al., 2005) have shown that the function of the ORC pathway is directly regulated by the olfactory input. TN input alters GnRH signaling transduction and decreases dopamine release in the retina, thereby increasing outer retinal sensitivity and inner retinal activity (e.g., firing of ganglion cells). Specifically, the olfactory input mediated by the ORC pathway decreases the light threshold (i.e., the minimum light intensity required to fire evoked action potentials) of retinal ganglion cells, and thereby increases retinal sensitivity. Together, the olfactory input amplifies behavioral visual sensitivity (Maaswinkel and Li, 2003).

Zebrafish maintain high evolutionary proximity to mammals, and their retinas share great similarities to humans (e.g., structure, cellular organization, neural circuitry and signaling transmission) (Li, 2001; Vacaru et al., 2014). While much progress has been made to understand the anatomy of crossmodal circuitry in zebrafish, our knowledge of the underlying regulatory mechanism and physiological roles of centrifugal input to the retina is still in its nascent stage. Interestingly, Huang et al. (2005) demonstrate how the visual sensitivity in zebrafish is increased in the presence of olfactory signals whereas disrupting the ORC pathway impairs visual function. An important observation found in that work reveals the importance of olfactory signals for vision. According to Huang et al. (2005), under normal conditions the minimum threshold light intensity to invoke a retinal ganglion cell response (measured in spikes/sec) in a dark-adapted zebrafish embryo may decrease 1–2 log units after olfactory stimulation. This demonstrates the dramatic impact of olfactory signals on vision.

Such a sudden gain in visual sensitivity through olfactory stimulation is an intriguing target for a computational model. We argue that visual sensitivity follows the statistical Extreme Value Theory (EVT). The mean visual sensitivity does not clearly explain the increased sensitivity due to olfactory signals since that scenario is able to sense a stimulus that is an extreme aberration from the norm, i.e., retinal ganglion cell responses without any olfactory stimulation. EVT lays solid groundwork for modeling as it is independent of the underlying distribution of data (all of the cell responses) and is only applicable to the tails of the distribution (the extremes) such that samples which have the least, or no possible, probability of occurrence under a central tendency model are distinguished, providing greater discrimination while requiring few statistical assumptions.

At a deeper level, one can ask the following question: is there a theoretical justification for using EVT for neural modeling? Our key insight is that the characterization of the firing behavior of a neuron as repeated integration/thresholding within a circuit suggests positive answers to these questions. Neurons are generally modeled as an electro-chemical process integrating input (ions) and eventually crossing a threshold whereby they fire and release ions. We posit that this inherently leads to an EVT-based model because the distribution of samples that exceed a threshold T likely yields an extreme value distribution (EVD). If all neurons use a fixed threshold T, the inputs to subsequent neurons in the circuit must follow an EVD, with each neuron integrating data from such a distribution and thresholding it. Thus, EVT can provide a plausible consistent multi-layer neuron model.

Beyond the merits of cultivating a better understanding of the operation of cross-modal sensory information integration in vertebrates, there is the possibility that an accurate computational model for this phenomenon could translate into a general algorithm for pattern recognition tasks in computer science. A direct application of this method lies in the development of novel information fusion algorithms that leverage inputs from multiple sensory modalities, i.e., vision and audition (Nagrani et al., 2018). Another practical application is the invention of innovative sensors capable of detecting changes in the environment and then re-configuring on the fly to change operational parameters and power consumption requirements. Currently, sensors are typically designed to sense a single type of physical property such as temperature, pressure, radiation, motion or proximity. But with a biologically-consistent model they could be remodeled to use multiple observations from the environment for more agile operation. The work presented in this article is in this spirit of leveraging biological observations to forward engineer algorithms that can operate in a general context.

In the following sections, we provide a detailed explanation of our work. Section 2 describes the single unit cell recording procedure from which our analysis is derived and the definition of EVT from which the proposed model is based. Section 3 goes on to describe the exact specification of that model. Section 4 describes our experiments and Section 5 presents the corresponding results. Finally Section 6 concludes by putting this research into a larger biological and computational context.

# 2. MATERIALS AND METHODS

In this section, we explain the methods we use that are crucial for understanding our computational model of cross-modal sensory information integration. This includes the physical experiments that were conducted to collect the source data, as well as the formal elements of EVT.

# 2.1. Single-Unit Recordings and Odor Stimulation

This research builds upon the previous work of Huang et al. (2005). An overview is provided in **Figure 1B**. Traces of RGC are recorded before and after odor stimulation (the sites of odor treatment are indicated by numbers 1 and 2 in **Figure 1B**), or when dopamine and/or GnRH signaling transduction is manipulated by the application of receptor agonists or antagonists (indicated by numbers 3–8 in **Figure 1B**). For electrophysiological recordings, zebrafish were anesthetized with 0.04% 3-amino benzoic acid and immobilized by intraperitoneal injections of 3 5 µl of 0.5 mg ml−<sup>1</sup> gallamine triethiodide

dissolved in phosphate-buffered saline (PBS), and then placed on a wet sponge with most of the body covered by a wet paper towel. A slow stream of system water (distilled water with ocean salt added, 3 g gal−<sup>1</sup> , pH 7.0) was directed into the mouth to keep the fish oxygenized. The eye was slightly pulled out of its socket and held in place by glass rods, thus exposing the optic nerve. Single-unit RGC responses (determined by the spike waveform) were recorded from the optic nerve by using a Tungsten microelectrode (resistance, 5 10 M). Electrical signals were filtered with a band pass filter between 30 and 3, 000 Hz.

To test the effect of olfactory stimulation on visual sensitivity, we measured the light threshold required to evoke RGC responses before and after olfactory stimulation. Each fish was dark adapted for 30 min before the first RGC recording was made. The light stimuli (full-field dim white light, generated by a halogen bulb) were directed to the fish eye via a mirror system. The intensity of the unattenuated light beam (log I = 0) measured in front of the fish eye was 670 µW cm−<sup>2</sup> (Optical Power Meter, UDT Instruments, MD, USA). To determine the threshold, the light intensity was first set below threshold level (e.g., log I = −6.0) and then increased by 0.5 log-unit steps until the first lightevoked RGC responses were recorded (criteria, 20% above or below the rate of spontaneous firing). This light intensity was noted as the threshold. For each recording, 10 stimuli (600 ms flashes) were delivered at 3 s intervals.

Amino acids (methionine) were chosen to stimulate the olfactory neurons to activate the ORC pathway. Previous studies have demonstrated that amino acids are strong odors for zebrafish (Edwards and Michel, 2002). Among the amino acids tested in zebrafish, methionine produced the most obvious and dose-dependent responses on visual function (Maaswinkel and Li, 2003). In this study, odors (methionine, 0.5, 2, and 5 mM; total 8 10 µl per stimulation) were delivered to the nostril through a glass pipette. The light threshold required to evoke RGC responses was measured before the application of methionine, and was measured again within 10 s following the application of methionine. Thereafter, the measurement was repeated at 1 min intervals for 10 min. In total, 24 cells were recorded. 24 animals were used in this process with 1 cell/animal for the recordings. Among these 24 animals, in response to odor stimulation, 17 showed increased visual sensitivity. In the remaining 7 animals, 6 showed no changes in visual sensitivity and 1 showed decreased visual sensitivity.

#### 2.2. Extreme Value Theory

The extreme value theorem (Coles, 2001) that underpins EVT (**Figure 1C**) is very similar to the central limit theorem (Jaynes, 2003). Both theorems involve limiting behaviors of distributions of independent and identically distributed random variables as n, the number of random variables, tends to ∞. However while the central limit theorem is concerned with the behavior of entire distributions of random variables, the extreme value theorem only applies to the random variables at the tails of those distributions.

To state this difference precisely, if x1, x2, ... , x<sup>n</sup> represent the i.i.d. random variables from a distribution, then the central limit theorem describes the limiting behavior of x1, x2, ... , x<sup>n</sup> while the extreme value theorem describes the limiting behavior of the extremes: max(x1, x2, ... , xn) or min(x1, x2, ... , xn) (Coles, 2001). It encompasses a number of distributions that apply to extrema.

An extreme value distribution is a limiting model for the maximums and minimums of a dataset. A limiting distribution simply models how large (or small) the data to be modeled will probably get. It is widely used in applications where there is interest in not only estimating the average, but also the maximum or minimum (Weibull, 1951, 1952; Galambos, 1994; Castillo et al., 2005). For example, when designing a dam, engineers might not only be interested in the average yearly flood which foretells the amount of water to be stored in the reservoir, but also in the maximum flood, the maximum intensity of earthquakes in the region during the past decade, or maximum strength of concrete to be used in building the dam to mitigate the possibility of a disaster. Castillo et al. (2005) list a number of applications where extreme value distributions can be used.

Now that the preliminaries have been covered, we can formally define an extreme value theorem (Fisher and Tippett, 1928):

Let (s1, s2, ..., sn) be a sequence of independent and identically distributed samples and let M<sup>n</sup> = max(s1, s2, ..., sn). If a sequence of pairs of real numbers (an, bn) exists such that each a<sup>n</sup> > 0 and

$$\lim\_{n \to \infty} P\left(\frac{M\_n - b\_n}{a\_n} \le \varkappa\right) = F(\varkappa) \tag{1}$$

then if F(x) is a non-degenerate distribution function, it belongs to one of three extreme value distributions: Gumbel, Fréchet or Reverse Weibull.

In contrast to the Gumbel or Fréchet distributions which are used for unbounded data, the Weibull distribution applies to data that are bounded from below and when the shape (k) and scale (λ) parameters are positive (the Reverse Weibull is simply the opposite of the Weibull's non-degenerate distribution function). Moreover, the Weibull is used for modeling minima. In order to use it for modeling data that fall in the upper tail of a distribution, a minor adjustment needs to be made by flipping the data such that maxima become minima before applying the Weibull distribution. The probability distribution function of the two-parameter Weibull distribution is given as:

$$f(\mathbf{x}; \lambda, k) = \begin{cases} \frac{k}{\lambda} (\frac{\mathbf{x}}{\lambda})^{k-1} e^{-\left(\frac{\mathbf{x}}{\lambda}\right)^k}, & \text{if } \mathbf{x} \ge 1\\ 0, & \text{if } \mathbf{x} < \mathbf{0} \end{cases} \tag{2}$$

Note that there are other types of extreme value theorems one can make use of, such as the Pickands-Balkema-de Haam Theorem (Pickands, 1975). We limit ourselves to the theorem in Equation (1) in this work for the modeling of explicit tail data, but we will invoke the Pareto distribution, which is derived from the Pickands-Balkema-de Haam Theorem, in the modeling of the overall distribution. This is described below in the next section.

### 3. A MODEL FOR CROSS-MODAL SENSORY INFORMATION INTEGRATION

Now that the relevant background has been introduced, we formally define our computational model for cross-modal sensory information integration (**Figure 2**). It is motivated by the following hypothesis: The tuning curves for RGC responses with and without olfactory signals are different. The extreme values in the tails of the distributions underlying those curves contribute to the determination of the visual sensitivity of zebrafish and should not be discarded as outliers.

The single unit recordings that we used for our experiments can be regarded as samples from a large population. One way to infer more about the population statistics is to extrapolate from the available samples by fitting distributions to them and sampling additional data. However, fitting a known distribution to available data can be difficult because of limited sample sizes, leaving one to make a "best guess" based on prior information about the behavior of large sample statistics. The best guess can come from making an assumption (for example, a null hypothesis as a starting place), or a more rigorous method of model selection using some metric.

If n represents the sample size, n → ∞ with the number of RGC responses acquired from an animal as it senses its environment over time. And the distribution of mean RGC responses calculated throughout an animal's entire lifecycle becomes Gaussian. This assumption directly follows from the central limit theorem. So perhaps the underlying distribution of measured responses is also Gaussian (a typical assumption in such modeling). Because our experiments involve two different sets of RGC responses, with and without olfaction, we can hypothesize that each set is normally distributed with varying parameters. This null hypothesis can be tested through commonly used measures of normality, failing which it can be rejected and we can look for alternative distributions using a model selection approach.

In statistical modeling, statisticians are often faced with the task of selecting a suitable model (a distribution, in our case) among a set of viable and finite candidates. There are several metrics or selection criteria one can use to determine the best explanatory model given the data. The Bayesian Information Criterion (BIC) (Schwarz et al., 1978; Neath and Cavanaugh, 2012) serves as a canonical method for model selection when priors are hard to state precisely. In a large sample setting the model found by BIC is equivalent to the candidate model that is a posteriori most probable, given the available data. It primarily amounts to maximizing the likelihood function separately for each candidate model and then choosing the one for which the

log likelihood is the largest, with a fixed penalty term for guessing the wrong model.

To identify a good distribution to fit to non-normally distributed empirical data, we used a Matlab implementation of BIC<sup>1</sup> . A large set of valid parametric distributions were fit to the data and sorted using the output of the BIC metric to compare the goodness of the fits. The overall process returns a set of fitted distributions with their respective parameters. The list of distributions that were tried includes: Beta, Birnbaum-Saunders, Exponential, Extreme Value, Gamma, Generalized Extreme Value, Generalized Pareto, Inverse Gaussian, Logistic, Log-Logistic, Log-Normal, Nakagami, Rayleigh, Rician, t Location-Scale, and Weibull. It was assumed that all data were continuous.

the sensory input received.

Our initial assumption that the overall data representing RGC responses without olfactory signals are normally distributed was rejected by the normality tests at the 1% significance level (a detailed description of the normality tests is given in section 4). Using the BIC method, the distribution that fit accurately to the overall RGC response data without olfactory stimulation was found to be the Generalized Pareto distribution (see **Supplementary Material**). Interestingly, this distribution is considered to be in the EVT family. The null hypothesis that the overall RGC responses with olfactory stimulus are normally distributed was not rejected at the 1% significance level by the normality tests, thus we fit a Gaussian distribution to that data.

Suppose we have n observations, or number of RGC responses. If x<sup>i</sup> represents the i-th RGC response where i ∈ (1, 2, ... , n), the population statistics (mean µ and variance σ 2 ) of the RGC response data with olfactory signal are found as the

<sup>1</sup> github.com/dcherian/tools/blob/master/misc/allfitdist.m

unbiased estimates of the distribution parameters and are given by the following equations:

$$\mu = \sum\_{i=1}^{n} \frac{\varkappa\_i}{n} \tag{3}$$

$$\sigma^2 = \frac{1}{n-1} \sum\_{i=1}^n (\alpha\_i - \mu)^2 \text{ for all } i \in (1, 2, 3, \dots, n) \tag{4}$$

The probability density function for the Generalized Pareto distribution with shape parameter k, scale parameter σ and threshold parameter τ is given by the following equation:

$$y = f\left(\mathbf{x} \mid k, \sigma, \tau\right) = \left(\frac{1}{\sigma}\right) \left\{ 1 + k \frac{\left(\mathbf{x} - \tau\right)}{\sigma} \right\}^{-1 - \frac{1}{k}} \tag{5}$$

We used maximum likelihood to estimate the parameters k and σ from the two-parameter Generalized Pareto distribution by fitting RGC responses without olfaction<sup>2</sup> .

Having access to a model of the entire population facilitates generative sampling, which in turn allows for better tail modeling, and support for heightened visual sensitivity under certain conditions. Such generative processes in the brain may be responsible for a number of different phenomena, as they facilitate generalization in learning from limited sampling (Rao et al., 2002). We use random sampling and the Metropolis-Hastings algorithm, a Markov chain Monte Carlo (MCMC) sampling method (Hastings, 1970) to generate in total 100, 000 simulated RGC responses with and without olfaction, respectively. The maximum (or the minimum) RGC response values within these samples follow an EVD. For our analysis, we concentrate only on the maximum RGC responses from the distributions described above because the lowest possible RGC response can be 0 spikes per second, indicating no response. Since the RGC responses (both with and without olfactory signals) can be assumed to be i.i.d samples from continuous distributions that are bounded from below, the Weibull distribution is the correct choice for modeling them. We expect the Weibull cumulative distribution curves (CDFs) for RGC responses with and without olfaction to be widely separated and the threshold RGC response value for an olfactory signal to shift sensitivity leftward (see **Figure 3** for an example), indicating that the cells are now more sensitive. This effect, replicated within the model, would confirm in a more rigorous sense that the presence of olfactory signals increases the fish's sensitivity toward its surrounding and almost endows it with night vision that would be otherwise impossible in absence of those signals.

This process is analogous to the super-additivity phenomenon in the multi-sensory superior colliculus of higher-order organisms like mammals, where the presence of two weak sensory signals from the environment enhances the animal's neural response toward that environment (Holmes and Spence, 2005). The RGC threshold value represents an average RGC response for visual sensitivity, which changes throughout an animal's entire life-cycle as it adapts to an ever-changing environment. However, the threshold varies (decreases or increases) in the presence or absence of a sensory stimulus other than visual input. This leads us to the possibility of the existence of some decision making mechanism in the fish's brain that toggles between two different distributions to adjust the tuning of the RGCs based on sensory input. Mathematically, this decision making procedure can be implemented as an indicator function I(x). If θ represents the parameters of an RGC distribution, i.e., the prior information available for RGC responses with or without olfactory signals and x represents a new RGC response due to a stimulus from the environment such that x ∈ R n (here n = 22, as we successfully retrieved 22 dimensions representing RGC spikes over time after stimulation of the olfactory neurons from the wet-bench experiments of Huang et al. For further explanation, see section 4), then the indicator function I(x) can be represented as:

$$I\left(\mathbf{x}\mid\theta\right) = \begin{cases} 1, \text{if } \text{olfactory signal is present} \\ 0, \text{otherwise} \end{cases} \tag{6}$$

We speculate that the actual neural computation for the overall phenomenon is far more complex and is not restricted to just two modalities. However, given the recordings available for this study, we limit our model to just one particular circuit.

#### 3.1. Choices for an Indicator Function

For the indicator function, we address the following problem: given a set of vectors representing RGC responses in spikes/s with and without olfactory signals, is it possible for an indicator function to identify whether a new RGC response has been triggered after an olfactory signal or not? Our intuition behind using an indicator function is that such a process exists in some capacity in the brain where the presence of one signal enhances the other signal, thereby eliciting responses much different from the situation when the signal is not present. In essence, this task can be formulated as a binary classification problem with two possible outcomes: presence or absence of olfactory signals. Ideally, any discrimitative supervised learning method can easily solve the problem. For our analysis, we examine the utility of support vector machines and an artificial neural network which, to some extent, mimics the functions of a biological neuron and is closer to the mechanism that the brain uses to process such signals. The motivation for choosing these particular classifiers is their simplicity—we desire an indicator function with an efficient training regime that can operate over thousands of multi-dimensional data points, such as a large collection of RGC responses. Other classifiers (e.g., decision trees, random forests, logistic regression) may also be suitable.

#### 3.1.1. Support Vector Machine

The Support Vector Machine (SVM) is a supervised learning approach that is widely used for classification and regression

<sup>2</sup>For finding the maximum likelihood estimates of the Generalized Pareto distribution, we used the Matlab function gpfit, which only returns the estimates of the shape k and scale σ parameters of a two-parameter Generalized Pareto distribution. The function makedist was then used to create a probability distribution object reflecting where samples are taken from, using the parameters k and σ.

analysis (Cortes and Vapnik, 1995). Since our data is numeric and high-dimensional, SVM is a natural choice as it has been found to be extremely efficient in high-dimensional spaces for largescale classification problems. SVMs use a subset of training points in the decision function, which form the "support vectors" that define the decision boundary between classes. As a consequence, it has been found to be memory efficient and has fast execution times if the data are normalized. For analysis, we assumed our data to be linearly separable and used a linear SVM formulation. We normalize all data using min-max normalization.

An SVM model with a set of labeled training data tries to find an optimal hyperplane for classifying new samples based on some constraints. Given a training dataset, D = (x<sup>i</sup> , yi) of size m with x<sup>i</sup> = (x1, x2, ..., xm), an n-dimensional feature/attribute vector, and label y<sup>i</sup> = -1 or +1, formally the SVM classifier can be defined as a quadratic optimization problem solving the following equation:

$$\min \|\boldsymbol{w}\|^2 \\ \text{s.t.} \\ \boldsymbol{y}\_i(\boldsymbol{w}^T \boldsymbol{x}\_i + \boldsymbol{b}) \succeq 1 \text{ for all } i \tag{7}$$

where w = (w1, w2, ..., wn) is a weight vector and b is the bias.

An important consideration when training an SVM model is the parameter C that dictates the trade-off between having a wide margin and correctly classifying training data.

$$\min \|\boldsymbol{w}\|^2 + C \sum\_{1}^{m} \xi\_i \text{ s.t. } \wp\_i(\boldsymbol{w}^T \boldsymbol{x}\_i + b) \ge (1 - \xi\_i), \xi\_i \ge 0 \text{ for all } i \tag{8}$$

A larger value of C implies a smaller number of mis-classified training samples and is prone to overfitting.

#### 3.1.2. Artificial Neural Network

We also consider a multi-layer perceptron (MLP) neural network as the indicator function. Similar to SVM, MLP is a supervised learning algorithm that learns a non-linear mapping from input x ∈ R n , where n represents the number of dimensions, to y ∈ R <sup>m</sup> where m can be any number m < n, depending on the number of classes in the training dataset. However, unlike SVMs, a simple MLP includes one or more hidden layers consisting of artificial neurons. The hidden layers act as feature detectors and gradually discover the salient features of the training data through backpropagation (Rumelhart et al., 1986; Werbos, 1990). Each neuron includes a non-linear and differential activation function and is connected to every neuron in the previous layer exhibiting a high degree of connectivity between layers. As a result, due to the distributed nature of non-linearities, the learning process is difficult to visualize. However, neural networks are usually assumed to be non-parametric functions, i.e., they can be used as function approximators without having any prior information about the distribution of input or training dataset and hence are well suited to represent the indicator function. If x represents a p-dimensional input vector such that x = (x1, x2, x3, ..., xp) with y = (+1, −1) as labels and g : R 7→ R as the activation function, then the equation for a single neuron is given by:

$$\mathcal{Y} = \mathcal{g}\left(b + \sum\_{i=1}^{p} \boldsymbol{w}\_i \boldsymbol{\mathfrak{x}}\_i\right) \tag{9}$$

where w = - w1,w2,w3, ...,w<sup>p</sup> represents the weights learned through backpropagation.

## 4. EXPERIMENTS

#### 4.1. Data Collection and Representation

As stated above, the first step in building a computational model of this nature is to attempt to define the underlying distribution of the data one is trying to explore. We use the data from a study by Huang et al. (2005) for our analysis. The data consists of single unit RGC responses measured in spikes/sec before and after olfactory stimulation under varying light intensity (see **Figure 2** from Huang et al.). In terms of raw data organization, it is primarily a histogram with the x-axis representing the visual sensitivity of fish binned into approximately 22 positions representing a timestamp and their corresponding frequency measured in spikes/sec on the y-axis. Under normal conditions, the minimum threshold light intensity to invoke a retinal ganglion cell response in a dark-adapted zebrafish embryo is 10−<sup>5</sup> . However, with olfactory stimulation with methionine, the threshold light intensity decreases to 10−<sup>6</sup> . We calculated the minimum RGC response threshold to be at 75 spikes/s. Hence, the data can be separated into two parts: one with olfactory stimulus and the other without it. In total, there were 22 RGC responses across time with olfactory stimulus and 29 without olfactory stimulation.

#### 4.2. Experiment 1

The first experiment was to check whether the raw data we collected from the experiments confirms our hypothesis that the EVT can be applied to build an accurate model. We posit that since the RGC responses with olfactory stimulation represent extreme aberration from the baseline and are nonnegative integers, the Weibull distribution is the right candidate for modeling our data. But how differently does our data fit with the Weibull distribution vs. a central tendency model like the Gaussian distribution? We explore this by comparing the CDFs of the Weibull and Gaussian distributions with parameters derived from our data.

#### 4.3. Tests of Normality and Synthetic Data Generation

Using the data collected from wet-bench experiments as a basis, we simulated an expansive data space by fitting distributions over the original data. The goal was to generate as much evidence as possible for statistical inference. However, in order to fit distributions to generate more samples from the existing data, we need to make some assumptions about the underlying distribution. Initially, as described above in section 3, we assumed a null hypothesis that the distribution of RGC responses in a zebrafish throughout its entire lifecycle is Gaussian. Since our work involves two different sets of RGC responses—one with olfactory stimulus and the other without it—under this assumption the distributions underlying each should be Gaussian with different parameters. To test this, we performed several commonly used tests of normality: the Kolmogorov Smirnov test (Massey, 1951), the Shapiro-Wilk test (Shapiro and Wilk, 1965), and a Lilliefors test (Lilliefors, 1967, 1969; Conover and Conover, 1980) 3 . Due to the small sample size (n = 22 or 29), we preferred the Shapiro-Wilk test over Kolmogorov-Smirnov and Lilliefors. For datasets that failed the normality test, The BIC selection criterion was deployed to find another distribution with the best fit. Afterwards, we generated 100, 000 non-negative samples of RGC responses from the respective distributions for further analysis.

# 4.4. Experiment 2

The second experiment was to check whether the points we sampled confirm our hypothesis that the EVT can be applied in a generative scenario. In order to verify this, we fit a Weibull distribution to the top n RGC responses to understand how the curves vary when olfactory input is present as opposed to when it is not. The value n was selected via empirical observation. The sampling methods used were: random sampling and MCMC sampling. Since EVDs like the Weibull only apply to samples at the tails of distributions, it is independent of the underlying distribution of the data as a whole. Hence, irrespective of the overall data distribution and sampling process, the results of Experiment 2 for the Weibull distributions for the top n responses should ideally be similar to Experiment 1. We expect the Weibull cumulative distribution functions for data with and without olfactory stimulus to be widely separated, with the curve for data with olfaction shifting leftward, giving higher probability scores to RGC responses that would be improbable under conditions where olfaction is not engaged.

### 4.5. Experiment 3

Additionally, we wanted to corroborate whether we can define a deterministic indicator function such that given some RGC response it is possible for the function to identify if an olfactory stimulus is present or not. In essence, this task becomes a binary classification problem where the presence of olfactory signals can be labeled as 1 and the absence as 0. As described above in section 3, we use a linear SVM or a multi-layer perceptron as our binary classifier. For consistency in the operation of the indicator function, we limit the dimensionality of all vectors to the dimensionality of RGC responses with olfactory stimulus (n = 22). We use the 100,000 samples we generated for each scenario (with olfactory stimulus and without olfactory stimulus), dividing the sets into 80% training and 20% testing partitions.

In summary, the entire modeling effort is encapsulated in the following steps (also depicted in **Figure 2**):

1. **Data collection and representation.** This step consists of collecting and representing data based on the wetbench experiments for control (without any stimulation) and experimental (with olfactory stimulation) zebrafish as a histogram and collecting the statistics for further analysis.


# 5. RESULTS

## 5.1. Experiment 1

**Figure 3** depicts the result of Experiment 1, which was conducted to examine the difference between central tendency modeling and EVT modeling. The data for this experiment were what was directly collected from the wet-bench experiments for both control (without olfaction) and experimental (with olfaction) zebrafish.

As can be seen in the figure, with olfactory stimulation the visual sensitivity in zebrafish shifts leftward, making the RGC responses below the normal threshold of 75 spikes/s probable, as indicated by the physiology experiments of Huang et al. (2005). Moreover, if we look closely, the Weibull distributions (represented by the red and blue solid and dashed lines) are a better fit to the data because the RGC responses with olfactory stimulation represent a set of extreme responses as opposed to RGC responses without any stimulation. If we fix our attention at the threshold RGC response at 75 spikes/s, the Weibull curves provide a better explanation for getting an RGC response below 75 spikes/sec for olfactory stimulation in comparison to the normal distribution, which makes those values more improbable. In other words, the tuning becomes more sensitive if we use the Weibull distribution. We plotted the curves by varying n (n = 3, 8) of the top-n RGC responses. The tuning becomes more sensitive as n becomes smaller.

<sup>3</sup>We used the following Matlab implementations of the normality tests: lillietest (for the Lilliefors test), swtest (from Matlab central for the Shapiro-Wilk test), kstest (for the one-sample Kolmogorov-Smirnov test). Each of these tests returns a decision (1 or 0) for the null hypothesis that the data comes from a distribution in the normal family, against the alternative that it does not come from such a distribution. A result of 1 rejects the null hypothesis at the 5% significance level (default). For our experiments, we set the significance level to 1%.

# 5.2. Tests of Normality and Synthetic Data Generation

The null hypothesis that the data without olfactory stimulus are normally distributed was rejected at the 1% significance level for all of the tests. However, the other assumption of normality for data with olfactory stimulus was not rejected at the 1% significance level. Based on these results, we fit a Gaussian distribution to the data with olfactory stimulus. Using the BIC selection criterion to find the best fit, the distribution for the data without olfactory stimulus was determined to be Generalized Pareto. We then collected non-negative samples simulating RGC responses via random sampling or MCMC sampling (100, 000 samples from each sampling method), to be used for fitting a Weibull distribution to the top n samples in order to understand how the curves vary when olfactory input is present (i.e., when the overall distribution is Gaussian) as opposed to when it is not (i.e., when the overall distribution is Pareto).

#### 5.3. Experiment 2

**Figures 4**, **5** show the models of visual sensitivity calculated over the simulated data from random sampling and MCMC sampling<sup>4</sup> . Similar results are achieved for both sampling methods. An important observation to note here is that tuning is always more sensitive when olfactory stimulus is present. The values of n in this experiment are much larger (n = 50, 250) due to the increased availability of data, but still represent a small number of points from the tail of the overall distribution. The CDF curves for data with and without olfactory stimulation are widely separated and the width of separation increases as n grows larger. This reflects how the visual sensitivity threshold can change throughout a fish's life cycle as it is exposed to an ever-changing environment and acquires new RGC responses for modulating its internalized model of visual sensitivity. Note that zebrafish build new cells within their nervous systems via a neurogenesis process, meaning the number of responses available at a point in time can change in a non-stimulus dependent way. Our proposed model supports this phenomenon.

### 5.4. Experiment 3

With respect to testing the possible indicator functions I(x), we began by considering a linear binary SVM classifier trained using 80, 000 generated samples and tested using 20, 000 generated samples. With random sampling, we achieved a testing accuracy of 95.5 (± 0.163) percent, but with MCMC sampling accuracy decreased to 93.925 (± 0.123) percent. With a multi-layer perceptron classifier, the accuracy dropped to 95.25 (± 0.007) percent using the same training-testing split and data from MCMC sampling<sup>5</sup> . The success of this experiment establishes that the two different classes of RGC responses are separable. Thus it is possible, in a statistical learning sense, to have a mechanism to toggle between RGC tuning configurations when an olfactory stimulus is present and when it is not. One possibility for why the classification was successful in these experiments is that the indicator function implicitly learns that the data are distributed differently in the two classes (Generalized Pareto for data without olfactory stimulus and Gaussian for the data with olfactory stimulus). That the two classes of data are distributed

<sup>4</sup>We ran experiments 1 and 2 ten times. In each of those trials, the leftward shift of the distribution after olfactory stimulation was preserved.

<sup>5</sup>Each of these experiments was run ten times. The numbers in parentheses represent standard error.

FIGURE 4 | Experiment 2. Cumulative Distribution Functions for zebrafish with and without olfactory stimulation at light intensity 10−5 and 10−6, respectively, with data points generated through random sampling. The curves labeled "Control" in the legend describe the Weibull distributions (as represented by the solid blue and red lines) without olfactory stimulus. As can be seen, tuning is most sensitive when an olfactory stimulus is involved. Best viewed in color.

FIGURE 5 | Experiment 2. Cumulative Distribution Functions for zebrafish with and without olfactory stimulation at light intensity 10−5 and 10−6, respectively, with data points generated through MCMC sampling. The curves labeled "Control" in the legend describe the Weibull distributions (as represented by solid blue and red lines) without olfactory stimulus. The result is very similar to random sampling—the tuning is more sensitive when an olfactory stimulus is involved. Best viewed in color. differently lends further support to our hypothesis that an indicator function is involved in the integration of cross-modal sensory information—the distributional difference facilitates a very straightforward pattern recognition process to separate the classes.

# 6. DISCUSSION

As vertebrates evolved over centuries, sensory organs adapted with the ever-changing environment. In many vertebrate species, at any given time the brain integrates and processes multisensory information. In humans, for example, the functions of the olfactory and visual systems are influenced by sensory input from each organ. Most mammals have specialized multimodal neurons in the superior colliculus that are capable of integrating multiple stimuli from the environment and providing a uniform reaction. In lower vertebrates such as fish, however, such advanced mechanisms are absent. In zebrafish, the integration of sensory information from the olfactory system facilitates signaling transduction in the visual pathway. As a consequence, retinal neural activities such as the firing of retinal ganglion cells are increased. This is particularly important for wild type animals that live under natural environmental conditions. For example, zebrafish normally mate in the early morning hours before the sun comes up, during which time the light illumination is low. It is conceivable that under such conditions stimulation of olfactory neurons may increase visual sensitivity and thereby facilitate the process of mating. While the system mechanisms underlying this olfacto-retinal sensory integration have been well characterized, statistical models that describe the phenomenon at the cellular level have not been described. In this paper, we have described a computational model that supports the research into how the visual system integrates information from other sensory modalities.

The idea of building computational models for multisensory input has been explored previously (Anastasio et al., 2000; Driver and Noesselt, 2008; Angelaki et al., 2009). When it comes to determining the statistical relationship between sensory responses among different sensory organs, the Bayesian model has been a preferred framework. However, almost all of the existing work focuses on higher vertebrates such as mammals. Angelaki et al. (2009) attempted to reconcile the difference between the traditional physiological studies on multisensory integration with computational and psychological studies using Bayesian inference on the visual-vestibular system for the perception of self-motion in macaques. They describe how the multimodal neurons represent probablistic information defined by multiple stimuli and propose that special neurons accomplish near optimal cue integration through a linear summation of input signals.

With respect to models of simpler animals, Wessnitzer and Webb (2006) explore multimodal sensory integration for navigation from the physiological perspective of the insect's nervous system. In zebrafish, using a similar linear model (Hughes et al., 1998) the contribution of different types of cone photoreceptor cells to photopic spectral visual sensitivity was determined. This was done by re-modeling the electroretinographic data recorded from the cornea, which include absorbance spectrum of four types of cone photoreceptor cells (cone cells that are sensitive to ultra-violet light, blue light, green light, and red light, respectively) given as the visual pigment template for the appropriate maximum absorption, neural signals obtained from different cone cell types, relative fraction of the individual cone cells across the retina, and linear gains for each cone type (Cameron, 2002). The model incorporates the first-order cellular and biophysical aspects of cone photoreceptor cells and thereby predicts the second-order physiological functions of cone cell-mediated visual sensitivity. Using this model, linear gains that represent the strength of four different types of cone cell-derived neural signals onto four different inferred cone processes in the whole retina can be assessed.

Turning to extreme value theory, the objective of nearly all extant models in computational neuroscience has been to discard the extreme values located at the tails of distributions as noise and concentrate on the mean or average. However, evidence suggests that extremes, and not means, of cell responses direct activity in the brain. For example, the ability of primates, like macaque monkeys, to identify individual faces can be localized to a group of special neurons that fire in response to specific regions of the face (Freiwald et al., 2009). An interesting finding that came out of that study was that neurons were tuned to the geometry of extreme facial features. Previous investigations along this line concentrated on how the brain fundamentally adapts itself to the statistics of the sensory world, extracting relevant information from sensory inputs by modeling the distribution of inputs that are encountered by the organism (Simoncelli and Olshausen, 2001; Simoncelli, 2003). This led to the advent of "sparse coding" which attempts to explain how neurons encode sensory information using a small number of active neurons at any given point in time (Olshausen and Field, 1997). A direct extension of this work suggests that sparse coding is an allpervasive phenomenon used by all types of sensory neurons in different modalities across different species (Olshausen and Field, 2004). EVT builds upon these concepts but is more specialized.

Much prior work related to EVT modeling has focused on various non-biological applications from trend detection in ground-level ozone (Smith, 1989) to quantifying extreme precipitation levels using Generalized Pareto distributions (Cooley et al., 2007). Other applications of EVT include, but are not limited to, finance, telecommunications, the environment (Finkenstadt and Rootzén, 2003), and hydrology (Katz et al., 2002). Recent work in computer vision and machine learning has extensively used the concept of EVT (Shi et al., 2008; Broadwater and Chellappa, 2010; Scheirer et al., 2011, 2014; Fragoso et al., 2013). For instance, for biometric verification systems, Shi et al. (2008) used the Generalized Pareto Distribution to model the genuine and impostor scores and made a significant observation that the tails of each score distribution contain the most relevant information that helps in defining each distribution considered for prediction and the associated decision boundaries, which are often difficult to model.

Our research extends this theory to multi-sensory inputs through a model that demonstrates strong neural fidelity. With a biologically-consistent information fusion algorithm based on retinal circuits in the zebrafish, we believe that we have access to a better general solution to the problem at hand and possibly many other information processing problems of interest. In this article, we have developed a neural computation model that simulates the process of multi-organ sensory integration and predicts the consequence of sensory integration in higherorder brain functions. In contrast to Gaussian modeling, we propose that EVT models of the extrema found in the tails of the data can form a powerful basis for cross-modal sensory information integration, facilitating heightened sensitivity in targeted modalities that have been influenced by a stimulus in the environment. This resulted in the development of a computational EVT-based framework for multi-organ sensory integration in the zebrafish that is not only an explanatory model in neuroscience, but also shows promise for applications in machine learning and neuromorphic systems.

### DATA AVAILABILITY

The datasets analyzed for this study and the source code used for modeling have been released for reproducability and can be downloaded from https://github.com/sbanerj2/Zebrafish\_EVT.

#### REFERENCES


### ETHICS STATEMENT

An ethical review process was not required for our study. All data used in this article come from a previously published study (Huang et al., 2005). All experimental procedures in that paper adhered to the NIH guidelines for animals in research.

### AUTHOR CONTRIBUTIONS

WS and LL initially conceived of the idea. LL was responsible for conducting the wet-bench experiments and preparing the source data. SB designed, analyzed, implemented the model and wrote the paper. WS supervised the entire modeling effort.

#### FUNDING

The research was funded in part by the Department of Defense (Army Research Laboratory) under the contracts W911NF-16-1- 0316 and W911NF-18-1-0292.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fncom. 2019.00003/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Banerjee, Scheirer and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Compound Computational Model for Filling-In Processes Triggered by Edges: Watercolor Illusions

Hadar Cohen-Duwek\* and Hedva Spitzer

Vision Research Laboratory, School of Electrical Engineering, Tel-Aviv University, Tel Aviv, Israel

The goal of our research was to develop a compound computational model with the ability to predict different variations of the "watercolor effects" and additional filling-in effects that are triggered by edges. The model is based on a filling-in mechanism solved by a Poisson equation, which considers the different gradients as "heat sources" after the gradients modification. The biased (modified) contours (edges) are ranked and determined according to their dominancy across the different chromatic and achromatic channels. The color and intensity of the perceived surface are calculated through a diffusive filling-in process of color triggered by the enhanced and biased edges of stimulus formed as a result of oriented double-opponent receptive fields. The model can successfully predict both the assimilative and non-assimilative watercolor effects, as well as a number of "conflicting" visual effects. Furthermore, the model can also predict the classic Craik–O'Brien–Cornsweet (COC) effect. In summary, our proposed computational model is able to predict most of the "conflicting" filling-in effects that derive from edges that have been recently described in the literature, and thus supports the theory that a shared visual mechanism is responsible for the vast variety of the "conflicting" filling-in effects that derive from edges.

#### Edited by:

Haluk Ogmen, University of Denver, United States

#### Reviewed by:

C. Alejandro Párraga, Autonomous University of Barcelona, Spain Greg Francis, Purdue University, United States

> \*Correspondence: Hadar Cohen-Duwek hadarli@gmail.com

#### Specialty section:

This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience

Received: 13 December 2018 Accepted: 26 February 2019 Published: 22 March 2019

#### Citation:

Cohen-Duwek H and Spitzer H (2019) A Compound Computational Model for Filling-In Processes Triggered by Edges: Watercolor Illusions. Front. Neurosci. 13:225. doi: 10.3389/fnins.2019.00225 Keywords: computational models, watercolor effect, filling-in, diffusion process, visual system mechanism

# INTRODUCTION

One of the most important goals of the higher levels of visual system processing is to reconstruct an appropriate representation of a surface after edge detection is performed by early vision. Such tasks are attributed to the opponent receptive fields in the retina and in the lateral geniculate nucleus (LGN). The visual system processing involves the cortical double-opponent as well as the simple and complex receptive fields, which perform non-oriented and oriented edge detection of both chromatic and non-chromatic edges (von der Heydt et al., 2003).

There are a number of visual phenomena and illusions that can provide information about the mechanisms that enable the reconstruction of surfaces from their edges. These include the watercolor illusions (Pinna et al., 2001) and the Craik-O'Brien-Cornsweet illusion (Cornsweet, 1970). In this study we will concentrate mainly on developing a computational model for the watercolor illusions to include a prediction of "conflicting" watercolor effects.

The Watercolor Effect described in the literature refers to a phenomenon involving assimilative color spreading into an achromatic area, produced by a pair of heterochromatic contours surrounding an achromatic surface area (Pinna et al., 2001; Pinna, 2008; Devinck and Spillmann, 2009). The coloration extends up to about 45◦ (visual degree) and is approximately uniform (Pinna et al., 2001).

There have been many studies that investigated the chromatic and the luminance parameters required for the two inducing contours and for the inducing contours and background of the watercolor effect (Pinna et al., 2001; Devinck et al., 2005, 2006, 2014; Pinna and Grossberg, 2005; Pinna and Reeves, 2006; Tanca et al., 2010; Cao et al., 2011; Devinck and Knoblauch, 2012; Hazenberg and van Lier, 2013; Coia and Crognale, 2014; Coia et al., 2014). The conclusion was that even though many color combinations can produce the effect, the strongest result is induced by a combination of complementary colors. The studies of Pinna et al. (2001),Devinck et al. (2005, 2006) characterized these findings as assimilation effects (i.e., the perceived color is similar to the color of the nearest inducer). Reversing the colors of the two inducing contours, reverses the resulting perceived colors accordingly (Pinna, 2008).

However, a non-assimilation effect of coloration has also been discussed (Pinna, 2006; Kitaoka, 2007). Pinna (2006) reported that if one of the inducers is achromatic, while the other is chromatic, the induced color can be complementary to that of the chromatic inducer. Kitaoka (2007) demonstrated that a combination of red-magenta or green-cyan can give rise to a yellowish coloration, indicating that the perceived effect may not be completely attributable to assimilation effects. Indeed, an achromatic watercolor effect has been recently proved to exist, albeit with a lower magnitude than the chromatic watercolor effect (Cao et al., 2011).

The only computational model that has been reported to explain the watercolor effect is called the "Form And Color And Depth" (FACADE) model (Grossberg and Mingolla, 1985) and is based on neurophysiological evidence from neurons in the cortical areas V1–V4 (Pinna and Grossberg, 2005). This model also attempts to explain a number of other visual phenomena including the Kaniza illusion (Kanizsa, 1976), neon color spreading (van Tuijl and Leeuwenberg, 1979), simultaneous contrast, and assimilation effects. FACADE describes two main visual processing systems: a boundary contour system (BCS) that processes boundary or edge information; and a feature contour system (FCS) that uses information from the BCS to control the spreading (filling-in) of surface properties such as color and brightness. According to this model, higher contrast boundaries in the BCS inhibit lower-contrast boundaries thereby enabling color to flow out through weaker boundaries.

A number of studies have proposed the FACADE model as a possible mechanism for predicting the watercolor effect since it explains some of the properties of the phenomenon (Grossberg et al., 2005; Pinna and Grossberg, 2005; Pinna, 2006; Tanca et al., 2010). However, neither the mathematical equations of the FACADE model nor other previous studies have succeeded in simulating and predicting all the experimental findings concerning the watercolor effect. Moreover, the FACADE model cannot predict the non-assimilative version of the watercolor effect (Pinna et al., 2001; Kitaoka, 2007; Hazenberg and van Lier, 2013; Kimura and Kuroki, 2014a). Kitaoka (2007) observed that in the non-assimilative watercolor effect, the induced color becomes more prominent when the outer contour has a higher luminance (and thus a lower-contrast with respect to the white background) than the inner contour. In this case, the BCS in the FACADE model would be expected to inhibit the boundaries of the lower-contrast outer contour and permit the color of the outer contour to spread out. This prediction is not supported by the actual perceived color as demonstrated in **Figure 5**, where a yellowish color spreads in and there is no perceived magenta color that spreads out, as the FACADE model would predict.

At present, the visual mechanisms responsible for the watercolor effect are still unknown and the watercolor effect "presents a significant challenge to any complete model of chromatic assimilation" (Devinck et al., 2014).

In their study on the watercolor effect, Knoblauch et al. (Devinck et al., 2014) summarized the requirements for a future computational model: "In a hierarchical model, two other steps need to be considered, surface detection then color filling-in."

In this study, we present a computational model, which detects edges through biological receptive fields, modifies them, and then applies them as a trigger for a diffusive filling-in process. The objective of the model is to predict both the assimilative and the non-assimilative configurations of the watercolor effect.

# COMPUTATIONAL MODEL

The main building blocks of the model are: (A) The inducing stimulus (B) The chromatic and achromatic opponent receptive fields (RFs). (C) The oriented double-opponent RFs, which detect chromatic and achromatic edges. (D) Calculation of the modification value through determination of the dominant chromatic/achromatic stimulus edge among several edges, which have different spatial scales. (E) Calculation of the new modified edges that trigger a diffusive filling-in process. (F) The fillingin process, performed by solving the Poisson equation. (G) The perceived afterimage of both the assimilative and the nonassimilative watercolor effects (**Figures 1A-G**).

### Model Assumptions

The model is based on the following assumptions: (A) The visual system needs to reconstruct surfaces that are not represented in the early vision stages, which perform chromatic and achromatic edge detection (in the retina and the cortical V1 and V2 areas). In addition, we assume that in cases such as the watercolor stimuli, the visual system performs filling-in processes in order to make an "educated guess" and to reconstruct surfaces. (B) Each edge triggers a diffusion process and determines its color (Cohen-Duwek and Spitzer, 2018). (C) The trigger for the diffusion process is determined by the interactions between the gradients of the image, i.e., the gradients between the inner contour (IC) and the outer contour (OC), the gradients between the IC and the background, and between the OC and the background. The exact contribution of each gradient is determined automatically according to the chromatic and achromatic stimulus. (D) The visual system uses separated chromatic opponent channels [L/M, (L+M)/S and achromatic], in order to process each contrast color pathway separately (Kandel et al., 2012). This assumption

is in agreement with experimental studies which claimed that the (L/M) and S-cones are regulated differently with respect to the watercolor effect (Devinck et al., 2005; Kimura and Kuroki, 2014a,b). (E) The chromatic channels are mediated by the Luminance channel (the achromatic channel). This assumption is supported by the observation that there is color spreading in response to a stimulus where both the IC and OC have the same color (hue) but a different luminance (Devinck et al., 2006).

### Rationale for the Model

The early stages of the visual system, the retina, and the early visual areas V1 and V2, have receptive fields (RFs) that mainly detect edges. In the retina, for example, the opponent receptive fields perform a Difference of Gaussian (DOG) operation, which is approximately a second spatial derivative while the chromatic retinal opponent RFs performs derivatives on the color domain. The simple and complex RFs in the V1 and V2 areas perform oriented edge detection. It has been assumed that at higher visual processing levels, the system acts to reconstruct the surfaces that are not represented (lacked) by the early visual areas. In order to perceive the physical world and not only its edges/gradients, the system (visual system) needs to reconstruct the image from its edges (von der Heydt et al., 2003). To mimic the original surfaces, the system could use the image's original gradients (in a similar fashion to that used in the engineering world, i.e., by solving the Poisson equation or by any parallel method (Bertalmio et al., 2000; Pérez et al., 2003). However, we now believe that in addition, the visual system also performs additional tasks, which can be regarded as "educated guesses" in order to enhance important information in the scene. Examples of such "educated guesses" include: edge completion, detection of occluded objects in the image, and the interpretation of specific gradients as indicative of adjacent surfaces. The watercolor stimulus is such an example of specific edges, where the visual system supplies a guess regarding the chromatic surface. We suggest here, that this educated guess calculation is achieved by modifying the gradients and modifying the weights of the image gradients. In addition, we describe a set of rules that determine how the weights are calculated in the context of the stimulus.

In order to produce the chromatic (or the achromatic) diffusion process, the visual system needs to enhance or change the original gradients in order to obtain an image which creates the perception and avoids a return to the original image. Based on psychophysical findings, the model assumes that the chromatic edges, which determine the filling-in effect, are significantly influenced by the intensity and by the chromaticity of the contours (IC and OC) (Pinna et al., 2001; Devinck et al., 2005, 2006; Pinna and Grossberg, 2005; Pinna and Reeves, 2006; Cao et al., 2011; Hazenberg and van Lier, 2013; Coia and Crognale, 2014; Kimura and Kuroki, 2014a,b).

#### The Watercolor Stimulus

The input of the model comprises the watercolor stimulus and its variations, which are composed of a pair of heterochromatic contours surrounding achromatic surface areas, **Figure 1A**.

#### Chromatic and Achromatic Opponent RF

The first component of the model (**Figure 1B**) is designed to simulate the opponent receptive fields (Nicholls et al., 2001). The spatial response profile of the retinal ganglion RF is expressed by the commonly used DOG. The "center" signals for the three spectral regions, L, M, and S, (Long, Medium, and Short wavelength sensitivity, respectively) that feed the retinal ganglion cells, are defined as the integral of the cone quantum catches, Lcone, Mcone, and Scone with a Gaussian decaying spatial weight function (Shapley and Enroth-Cugell, 1984; Spitzer and Barkan, 2005):

$$\begin{aligned} i\_c &= i\_{cone} \ast f\_c; \ i \in \{L, M, S\} \\ i\_s &= i\_{cone} \ast f\_s; \ i \in \{L, M, S\} \\ f\_j &= \frac{\exp\left(\frac{-(x^2 + \nu^2)}{\rho\_j^2}\right)}{\pi \rho\_j^2}, \quad j \in \{c, s\} \end{aligned} \tag{1}$$

Where L<sup>c</sup> , M<sup>c</sup> and S<sup>c</sup> represent the response of the center area of the receptive field of each cell type, Equation 1. L<sup>s</sup> , M<sup>s</sup> , and S<sup>s</sup> represent the surround sub-region of these receptive fields. ρ<sup>c</sup> and ρ<sup>s</sup> represents the radius of the center and the surround regions, of the receptive field of the color-coding cells, respectively. f<sup>c</sup> and f<sup>s</sup> are the center and surround Gaussian profiles, respectively and ∗ represents the convolution operation.

For the center-surround cells, the opponent responses are expressed as: OPL+M<sup>−</sup> , OPS+(L+M) <sup>−</sup> and Y (for the summation of the L, M, and S channels) in order to express the Luminance channel.

OPRG : OPL+M<sup>−</sup> = L<sup>c</sup> − M<sup>s</sup> (Red − Green channel) OPBY : OPS+(L+M) <sup>−</sup> = S<sup>c</sup> − (L + M)<sup>s</sup> (Blue − YellowChannel) (2)

Y = L<sup>c</sup> + M<sup>c</sup> + S<sup>c</sup> (Luminance channel)

Where L<sup>c</sup> , M<sup>c</sup> , s<sup>c</sup> , L<sup>s</sup> , M<sup>s</sup> , and S<sup>c</sup> are the cell responses to the receptive filled sub-regions: center and surround, Equation (1).

#### Oriented Double-Opponent RF

The color coding of the opponent receptive fields, Equation (2), encodes color contrast, but not spatial contrast. In other words, the color opponent receptive fields are able to differentiate between colors, but cannot detect spatial gradients or edges (Conway, 2001; Spitzer and Barkan, 2005; Conway and Livingstone, 2006; Conway et al., 2010). The double opponent receptive fields, however, are sensitive to both spatial and chromatic gradients (Spitzer and Barkan, 2005) since they have color opponent receptive fields both at the center and in the surround RF regions (Shapley and Hawken, 2011). A large number of studies have reported that many double-opponent neurons are also orientation-selective (Thorell et al., 1984; Conway, 2001; Johnson et al., 2001, 2008; Horwitz et al., 2007; Solomon and Lennie, 2007; Conway et al., 2010). Accordingly, the model takes into account the oriented double opponent RF, ODO, to the three opponent RF channels, OPL+M<sup>−</sup> , OPS+(L+M) − , and Y (Conway and Livingstone, 2006), Equation (2). We modeled this chromatic RF structure, ODOL+M<sup>−</sup> , ODOS+(L+M) − and OY by a convolution between the Gabor function and the opponent responses, Equation (3), **Figure 1C**. It should be noted that previous work indicates that by using the linear Gabor function, we neglect some non-linearities e.g., half wave rectification in the simple cells and full rectification in the complex cells, in the neuronal responses (Movshon et al., 1978; Spitzer and Hochstein, 1985).

$$\begin{aligned} \textit{ODO}\_{L^{+}M^{-}} &= \textit{OP}\_{L^{+}M^{-}} \* \textit{Gabor}\_{\textit{odd},\theta,\sigma} \\ \textit{ODO}\_{S^{+}(L+M)^{-}} &= \textit{OP}\_{S^{+}(L+M)^{-}} \* \textit{Gabor}\_{\textit{odd},\theta,\sigma} \\ \textit{OY} &= \textit{Y} \* \textit{Gabor}\_{\textit{odd},\theta,\sigma} \end{aligned} \tag{3}$$

$$\begin{aligned} \textit{Gabor}\_{odd,\theta,\sigma} &= \exp(\frac{-\left(\mathbf{x'}^2 + \mathbf{y'^2}\right)}{2\sigma^2})\sin(2\pi\mathbf{x'})\\ \textit{Gabor}\_{even,\theta,\sigma} &= \exp(\frac{-\left(\mathbf{x'^2} + \mathbf{y'^2}\right)}{2\sigma^2})\cos(2\pi\,\mathbf{x'})\\ \textit{Where:} &\quad \mathbf{x'} = \textit{x}\cos(\theta) + \textit{y}\sin(\theta)\\ &\quad \mathbf{y'} = -\textit{x}\sin(\theta) + \textit{y}\cos(\theta) \end{aligned}$$

This opponency in both spatial and chromatic properties produces a spatio-oriented-chromatic edge detector, Equation (3).

Where θ represents the orientation of the normal to the parallel stripes of a Gabor function and σ is the standard deviation of the Gaussian envelope of the Gabor function.

#### Gradient Weights

We chose to express this property of gradient modification by adding weighted functions to the Oriented-double-opponent RF (**Figure 1D**). The model modifies the original gradients (Equation 3) by multiplying the double-opponent responses by the weight function, Equation (6), **Figure 1D**. In order to calculate the weight functions, several Gabor-filters on different scales [different standard deviations, σ, Equation (5)] are calculated and the maximum response to a specific Gabor RF scale is chosen as the weight function for each channel separately, Equation (6). This maximum response represents the dominant gradient in the image, which is used by the model to determine the strongest effect on the diffusion process. This determination of the strongest effect (i.e., the strongest edge in the stimulus) is in agreement with previously reported psychophysical findings (Pinna et al., 2001; Devinck et al., 2005, 2006; Kimura and Kuroki, 2014a,b). The multiplication operation of the chosen weight is done with a 2D Gabor filter, Equation (5). (It should be noted that we could also obtain good results by making a summation of the responses from all scales).

$$\begin{array}{c|c} R\_{RG,i} = & \left| OP\_{RG} \ast Gabor\_{even} \left( \theta, \sigma\_i \right) \right| \\ R\_{BY,i} = & \left| OP\_{BY} \ast Gabor\_{even} \left( \theta, \sigma\_i \right) \right| \\ R\_{Luminance,i} = & \left| Y \ast Gabor\_{even} \left( \theta, \sigma\_i \right) \right| \end{array} \tag{5}$$

Where σ<sup>i</sup> represents different standard deviations of the Gaussian envelope (different scales).

$$W\_{RG}(i,j) = \max\left\{ R\_{RG,1}(i,j), R\_{RG,2}(i,j), \dots, R\_{RG,N}(i,j) \right\}$$

$$W\_{BY}(i,j) = \max\left\{ R\_{BY,1}(i,j), R\_{BY,2}(i,j), \dots, R\_{BY,N}(i,j) \right\} \quad \text{(6)}$$

$$W\_Y(i,j) = \max\{ R\_{Luminance,1}(i,j), R\_{Luminance,2}(i,j), \dots, \max\{ R\_{Luminance,N}(i,j) \} \quad \text{(7)}$$

$$R\_{Luminance,N}(i,j) \}$$

Where WRG, WBY, and WLum are the maximal responses among the several scales at each channel.

This calculation is done separately for both the chromatic channels and the achromatic channels (RG, BY, and Y). After determining which scale yields the strongest response at each channel, the three responses are summarized across the channels, Equation (7), to reflect a combination of all the edges in each spatial location. In other words, the weight function W, for each spatial location in the image (or stimulus), is taken as the normalized sum of the maxima, values from the strongest response scale, across all the channels, Equation (7).

$$W = W\_{RG} + W\_{BY} + W\_Y \tag{7}$$

This calculation procedure can detect the middle chromatic (or achromatic) edge between the two contours (IC and OC), which are the triggers for the diffusion process. This detection is possible because in most cases, the dominant edge is a coarse edge, which contains the edge that is adjacent to the inner and the outer region. The center of this coarse region is often the edge between the two chromatic contours in the watercolor stimuli.

#### The Diffusion Triggers (Second Derivative)

The trigger for the diffusion process consists of the sum of two components: the modification component (β) and the "real" (α) oriented double-opponent RF component, Equation (8). These modification components are added separately for each orientation directions and then, the modified gradients are convolved again with an odd Gabor filter (in the same orientation, θ), Equation (10), in order to perform a second derivative. Both derivative direction (x and y axis, θ = 0 and θ = π 2 ) are then summarized in order to create the divergence, Equation (10), **Figure 1F**, which is then used as the trigger for the diffusion process in all the required directions, Equation (10), across each of the channels. The trigger for the diffusion process is the oriented-double-opponent response, Equation (3), multiplied by the weight function (W) in each individual channel, **Figure 1E**, Equation (8).

$$\begin{aligned} Trig\_{RG} &= ODO\_{RG} \cdot \left(\alpha + \beta \, W(\mathbf{x}, \mathbf{y})\right) \\ Trig\_{BY} &= ODO\_{BY} \cdot \left(\alpha + \beta \, W\right) \\ Trig\_Y &= OY \cdot \left(\alpha + \beta \, W\right) \end{aligned} \tag{8}$$

Where α and β are constants and α > β. TrigRG, TrigBY, and Trig<sup>Y</sup> are the diffusion triggers in each channel.

Note that the results of the above equations change only the weights of the ODO (Equation 3) responses, and therefore their spatial properties and polarities are retained. According to the suggested model, the prominent gradient makes the strongest contribution to the filling-in process, Equation (7). However, the other two gradients also contribute to the filling-in process, due to the chromatic and achromatic strength of their gradients. This consideration of the different gradients is in agreement with the Weber contrast rule (Kimura and Kuroki, 2014a).

#### Filling-In Process

The filling-in process is expressed by the diffusion (or heat) Equation (10) (Weickert, 1998), and is determined according to the weighted triggers, Equation (8), **Figure 1E**. The model assumes that the filling-in process represents "isomorphic diffusion" (von der Heydt et al., 2003; Cohen-Duwek and Spitzer, 2018), although it does not necessarily negate other possible filling-in mechanisms, such as "edge integration" (Rudd, 2014). This filling-in process is reminiscent of the physical diffusion process, where the signals spread in all directions, until "blocked" by another heat source (image edges). We would like to emphasize that this type of filling-in infers that the borders (chromatic or achromatic) do not function primarily as blockers, but instead they act as heat sources that can trigger the diffusion. We would like to emphasize that this type of filling-in infers that the borders (chromatic or achromatic) do not function primarily as blockers, but instead they act as heat sources that can trigger the diffusion, and then spread in opposite directions and thus trap the diffused color. The diffusion spread, therefore, will be blocked by the heat source, in such a case. These principles are applied in our model through the well-known diffusion equation (Weickert, 1998):

$$\frac{\partial \mathcal{I}\left(\mathbf{x}, \boldsymbol{\chi}, t\right)}{\partial t} - D \nabla^2 \mathcal{I}\left(\mathbf{x}, \boldsymbol{\chi}, t\right) = \mathbf{h}\_s = -\text{div}\left(T \text{rig}\_{\mathcal{C}}\right);$$

$$\text{where } \mathbf{c} = \{L^+ \mathcal{M}^-, \mathcal{S}^+(L+M)^-, \mathbf{Y}\} \qquad \text{(9)}$$

where I x, y, t denotes the image in a space-time location x, y, t , D is the diffusion (or heat) coefficient, and h<sup>s</sup> represents a heat source. The time course of the perceived image is assumed to be very fast, in accordance with previous reports (Pinna et al., 2001). This time course is also termed "immediate filling-in" (von der Heydt et al., 2003).

Following this assumption, for the sake of simplicity, we can ignore the fast-dynamic stages of the diffusion equation, and therefore compute only the steady-state stage of the diffusion

process. Consequently, the diffusion (heat) Equation (5) is reduced to the Poisson Equation (10).

$$D\nabla^2 \mathbf{I} = -\mathbf{h}\_\delta = \text{div}\left(T\text{rig}\_\delta\right); \quad \text{where} \ \mathbf{c} = \{\text{RG}, \text{BY}, \mathbf{Y}\} \tag{10}$$

$$D\nabla^2 \mathbf{I} = \text{div}((\alpha + \beta \,\mu \mathbf{V}) \cdot \text{ODO}) \tag{11}$$

The "heat sources" are the weighted second derivative of an opponent channel; **Figure 1E** (weighted oriented-doubleopponent). The heat equation (diffusion equation) with heat sources requires second derivatives, reflecting the "heat generation rate" which is the second derivatives of a heat source. Because the edges are playing a role as heat sources, the values near the edges do not decay over time. Since the two adjacent edges operate as heat sources with opposite signs, the conclusion is that they are operating with opposite directions, and therefore the diffusion process of one color (one heat source) cannot diffuse to the "other" direction. This approach is not consistent with previous reports that the edges function as borders that prevent the colors from spreading (Cohen and Grossberg, 1984; Grossberg and Mingolla, 1985, 1987; Pinna and Grossberg, 2005). In the suggested model the derivatives trigger a positive diffusion process toward one side of the spatial derivative and a "negative diffusion" process to the other side of the spatial derivative, **Figure 2** demonstrates this type of diffusion, which is considered separately for each color channel.

#### METHODS

In this section we describe each stage of the model's implementation in detail.

#### Opponent RF

For the sake of simplicity, we compute the opponent response of the opponent receptive fields as color-opponent only, where each chromatic encoder has the same spatial resolution. This is computed by an opponent color-transformation (van de Sande et al., 2010), Equation (12). This transformation converts each pixel of the image I0, in each chromatic channel R,G, and B into opponent color-space, via the transformation matrix O (van de Sande et al., 2010). In order to obtain more perceptual value in the luminance channel, we have slightly modified the transformation matrix O, and use a = 0.2989, b = 0.5870, and c = 0.1140, instead of using a = b = c = 1/ √ 3 as originally reported (van de Sande et al., 2010). These values are taken from the Y channel in YUV (or YIQ) color space. The Y represents the Luma information: Y = 0.2989R + 0.5870B + 0.1140C. IOPPONENT = OPPONENT{RGB} as follows:

$$I\_{\text{OPPONENT}} = \begin{pmatrix} O\_{\text{RG}} \\ O\_{\text{YB}} \\ O\_Y \end{pmatrix} = \begin{pmatrix} 1/\sqrt{2} & -1/\sqrt{2} & 0 \\ 1/\sqrt{6} & 1/\sqrt{6} & -2/\sqrt{6} \\ a & b & c \end{pmatrix} \begin{pmatrix} R \\ G \\ B \end{pmatrix} \tag{12}$$

Another perceptual option for the opponent transformation matrix is to use the transformation presented by Wandell (1995),

IOPPONENT = MOpponentW{MLMS {MXYZ {RGB}}} MXYZ = 0.4124 0.3576 0.1805 0.2126 0.7152 0.0722 0.0193 0.1192 0.9505 MLMS = 0.2430 0.8560 −0.0440 −0.3910 1.1650 0.0870 0.0100 −0.0080 0.5630 (13) MOpponentW = 1 0 0 −0.59 0.80 −0.12 −0.34 −0.11 0.93 IOPPONENT = O<sup>Y</sup> ORG OYB <sup>=</sup> 0.2814 0.6938 0.0638 −0.0971 0.1458 −0.0250 −0.0930 −0.2529 0.4665 R G B (14)

These matrix values are calculated from the linear conversion of the RGB color space to the XYZ color space, which is then converted to the LMS color space to which we apply the opponent transformation from Wandell (1995), Equation (13).

where ORG, OYB, and OY, Equations (12–14) are the new channels of the transformed image IOPPONENT. R, G, and B are the red, green, and blue channels of the input image I, respectively.

# Oriented Opponent and Double-Opponent RF

The oriented opponent RFs are modulated as convolution between each opponent channel and an odd Gabor function, Equation (4). For the sake of simplicity, we discretized the Gabor function and instead of computing the exact Gabor functions, we used a discrete derivative filter in two directions, vertical (yaxis, θ = 0), and horizontal (x-axis, θ = π 2 ), Equations (15–16) (Gonzalez and Woods, 2002).

$$Gabor\_{odd, \text{x}} \approx \text{G}\_{odd, \text{x}} = [-1, 1]; \text{ : } \\ \text{Gabor}\_{odd, \text{y}} \approx \text{G}\_{odd, \text{y}} = [\stackrel{-1}{1}] \tag{15}$$

$$\text{Gabor}\_{\text{cven}, \text{x}} \approx \text{G}\_{\text{cven}, \text{x}} = [-1, \, 2, -1] \\ \vdots \\ \text{Gabor}\_{\text{cven}, \text{y}} \approx \text{G}\_{\text{cven}, \text{y}} = \begin{bmatrix} 2 \\ 2 \end{bmatrix} \\ \tag{16}$$

The above discretization of the Gabor filters: Godd,<sup>x</sup> and Godd,<sup>y</sup> also represent the discrete gradient operator ∇ :

$$
\nabla I = (\nabla\_{\mathbf{x}} I, \nabla\_{\mathbf{y}} I) = (I \ast G\_{odd, \mathbf{x}}, I \ast G\_{odd, \mathbf{y}}) \tag{17}
$$

The structure of the oriented-double-opponent receptive field can be seen as a filter which acts as a second derivative in both the spatial and chromatic domains.

#### Weights of Modified Edges

In order to calculate the response of an opponent channel to a Gabor RF on different scales, Equation (5), we use a Gaussian Pyramid (Adelson et al., 1984). In this way, the image is downsampled instead of up-sampling the Gabor filter.

$$R\_{\varepsilon,i} = \left| Gaussian\text{Pyramid}\{OP\_{\varepsilon}\}\_{\sigma\_{i}} \* Galor\_{\varepsilon\text{ven}}\left(\theta\right)\right|\tag{18}$$

#### Filling-In Process

The divergence operator, div Equation (10), is computed as:

$$d\dot{\nu}\,(F) = \frac{\partial F}{\partial \boldsymbol{x}} + \frac{\partial F}{\partial \boldsymbol{y}} = F \ast G\_{\text{odd},\text{x}} + F \ast G\_{\text{odd},\text{y}} \tag{19}$$

Where F is an image input.

Therefore, Equation (10) can be written as:

$$
\Delta I\_{op} = \nabla^2 I\_{op} = \text{div(Trig)} = T \text{rig}\_{\times} \ast G\_{odd, \text{x}} + T \text{rig}\_{\times} \ast G\_{odd, \text{y}} \tag{20}
$$

#### Parameters

We performed a set of simulations in order to determine the constants α and β . We found that increasing the β parameter (increasing the weight of the modified gradient, ODO, Equation 8) increases the saturation of the predicted result (since the level of the relevant gradient is increased). This means that choosing a higher value for β increases the saturation of the filled-in predicted color and also increases its intensity while preserving its hue. The α parameter affects the magnitude of the original gradient of the original stimulus. We arrived at the conclusion that the ratio between α and β determines the level of the filled-in predicted saturation. In all the simulations presented here α = 1 and β = 0.5.

#### Comparison to Psychophysical Findings

In order to compare the predictions of the model to psychophysical findings we created sets of images that contain the same color values that have been used in previous psychophysical experiments (Devinck et al., 2005; Kimura and Kuroki, 2014b). Each color value used in the stimulus was converted from the CIE Lu'v' 1976 color space to the sRGB color space, in order to create the input images for the model. The model was then applied to each image stimulus, and the predicted colors were calculated and converted back to the CIE Lu'v' 1976 color space. These CIE Lu'v' 1976 color values are presented in the results section.

### RESULTS

The results present the simulations of the model through its equations (according to the Methods section) implemented by MATLAB software. The model's equations were solved in a similar way to that reported in "Methods for Solving Equations" (Simchony et al., 1990) but another option was through "Poisson Image Editing" (Pérez et al., 2003).

#### Model's Simulation and Predictions

The model and simulation results (**Figure 1G**) are divided into three parts. The first part presents the model predictions for the assimilative (classic) watercolor effect. The second part presents the predictions of the model for the non-assimilative (nonclassic) watercolor effect, while the third part presents the model predictions that relate to additional properties of the watercolor effect: the influence of the background luminance, and the effect of the inner color luminance on the perceived hue and the perceived brightness (Devinck et al., 2005, 2006; Cao et al., 2011; Kimura and Kuroki, 2014a,b).

#### Predictions—Assimilative (Classic) Watercolor Effect

The model simulations were tested on a large number of classic stimuli with a variety of chromatic thin polygonal curves (e.g., star shapes) that produce the watercolor effect. **Figure 3** shows that the model succeeded in predicting the correct coloration of the classic assimilative watercolor effect. Note that the most of the assimilative watercolor effects present the complementary colors of the IC and the OC (the IC and the OC color are complementary in these stimuli). Our model indeed predicts a strong filling-in color response to such stimuli, **Figures 3A–C**.

**Figure 3** demonstrates that the filling-in perceived color is more prominent in the predicted result (right side), which represents the model prediction for the corresponding stimulus, i.e., the original stimuli (left side). The filling-in effect of the stimuli with orange and purple polygonal edges were obtained as expected, **Figure 3A**, as well as a reddish color and cyan, **Figure 3B**. The level of saturation in the simulation results can be controlled by the parameters α and β , Equation (8). We also tested our model with achromatic watercolor stimulus. **Figure 3C** shows that the model correctly predicts a perceived darker or lighter inner area, according to the luminance of the inner contour.

#### **Comparison to psychophysical findings**

We confronted our model predictions with quantitative psychophysical results (Devinck et al., 2005). **Figure 4** presents

FIGURE 3 | The model's predictions for assimilative watercolor stimuli. (A) The classic watercolor stimulus (left) and the model's predictions (right). (B) Additional example of an assimilative watercolor stimulus (left), with different colors, and the model's predictions (right). (C) An example of achromatic watercolor stimulus (left) and the model's predictions (right). Our model predicts that in the assimilative watercolor stimuli, the inner contour color is spread to the inner area of the stars.

the predictions of the model in CIE Lu'v' (1976) coordinates instead of RGB images, see Methods. In order to enable the comparison between the model predictions and the psychophysical results, we applied the same set of colors as described in Devinck et al. (2005), as parameters to our model, see Methods.

**Figure 4** demonstrates the comparison of the model prediction with Devinck et al. (2005) findings, which tested the assimilative effect on three pairs of colors: Orange and Purple, Red and Green, and Blue and Yellow. Note that, the psychophysical findings are obtained from a hue cancellation test and therefore represent the complementary colors of the perceived colors; however, our results represent the predicted perceived colors. Most of the predicted colors, **Figure 4A**, are in agreement with the psychophysical findings, **Figure 4B**. Only in the orange and the purple stimuli pair the predicted color is slightly more yellowish then in the psychophysical findings for the IC: Orange OC: Purple stimulus (**Figure 4A** top left) and slightly more bluish then in the psychophysical findings for the IC: Purple OC: Orange stimulus (**Figure 4A** top right).

#### Predictions—Non-assimilative (Non-classic) Watercolor Effect

We also tested two known versions of the non-assimilative watercolor effect (Pinna, 2006; Kimura and Kuroki, 2014a). In

FIGURE 4 | Comparison between the predictions of the model and the psychophysical findings of the assimilative effect, both presented in u'v' (CIELu'v' 1976) color space. The prediction of the model (A) and the chromatic cancelation data (B) that are taken from Devinck et al. (2005). Each row (A,B) presents a pair of IC and OC colors, which are orange–purple, red–green, and blue–yellow, respectively. The colored dots (A) represent the predicted results. The colored lines (A) represent the hue line of the IC contour color that was used in each pair of contours.

this case, we chose to test the three chromatic stimuli colors as tested originally by Kimura and Kuroki (2014a) for the non-assimilative watercolor effect. The stimuli in these versions have chromatic and achromatic edges/contours (**Figure 5A**) or specific pairs of colors (**Figures 5B,C**).

Kimura and Kuroki (2014a,b) psychophysically tested stimuli similar to those in **Figures 5A**,**B** and found that the induced colors were yellowish. The psychophysical results also demonstrated that a stimulus such as that in **Figure 5A** (left star), yielded a complementary color (yellowish) to the OC (bluish). Our model correctly predicts these complementary perceived coloration effects (filling-in effect), **Figure 5A** (left star).

Again, in accordance with psychophysical findings, our model could also correctly predict the influence exerted by the location of the chromatic contours, as to whether the same or complementary filling-in color is perceived in the inner area (Pinna, 2006; Kimura and Kuroki, 2014a), **Figure 5A**.

Kimura and Kuroki (2014a) observed that the perceived colors were not necessarily the "same" as or "complementary" to the IC/OC, but could be a combination of the IC and OC colors, **Figures 5B,C** (left stars). In agreement, the model results (**Figure 5**II) show indeed that the perceived color is determined by combination of the outer and the inner contours. In **Figure 5B** (left star), for example, the red IC contributes the same (red) color to the coloration effect, while the magenta OC contributes its complementary color (green). An additive combination of red and green colors yields a perceived yellowish coloration (Berns, 2000). These results are consistent with the model principles and Equations [Filling-in process; Equation (10)], such that both the IC and OC contours contribute as triggers to the filling-in process. The model correctly predicts the general trend that has been shown in previously reported experimental results (Pinna and Reeves, 2006) where the perceived chromatic filling-in color was determined by the combined influence of the chromatic and achromatic edges.

#### **Comparison to psychophysical findings**

Furthermore, we confronted our model predictions with quantitative psychophysical results (Kimura and Kuroki, 2014b). In order to enable the comparison between the model predictions and the non-assimilative watercolor effect experiment results, we applied the same set of colors as described in the results of Kimura and Kuroki (2014b), as parameters to our model, see Methods.

The psychophysical experiments of Kimura and Kuroki (2014b) investigate both the assimilative and the non-assimilative effects as well as the role of intensity in the perceived effect. **Figure 6** presents the model predictions and the results of Kimura and Kuroki (2014b) on a large repertoire of stimuli.

**Figure 6** presents the predicted (A) and experimental results (B) of stimuli that share the same IC color at each sub-figure while the experiment tested 8 different OC colors. The top row presents the results for the red IC color and the bottom row presents the result for the achromatic IC color, while the outer color was presented with different chromatic colors. Left column presents the result when the IC color has a higher luminance level and the right column present the results when the IC color has a lower luminance level.

The stimuli with higher luminance of the red IC (**Figure 6B**) yielded perceived colors which were ranged from red to orange. Therefore, this trend of results shows an assimilative reddish color effect. The predicted result (**Figure 6A**) shows assimilate effects in adjustment to the red line. However, the perceived color is more reddish than orange as in the experimental results (**Figure 6B**). The stimuli with lower luminance of the red IC (**Figure 6B**) yielded an oval shape adjacent to the -S line. Our result also predicts an oval shape, but the shape is adjacent to the L line. It will be discussed in Discussion. The stimuli with higher luminance of the achromatic IC yielded a small magnitude of the perceived effects, in both the experimental (**Figure 6B**) and the predicted (**Figure 6A**) results. However, in the experimental results the effects slightly tend to be yellowish, while in the predicted results the effect is almost invisible (no filling-in effect). The stimuli with lower luminance of the achromatic IC also yielded a yellowish perceived color in the experimental results. In the predicted result the predicted colors are the complementary colors of the OC. It has to be noted that the achromatic configuration of the experimental result were tested also in additional studies such as Pinna (2006) and Hazenberg and van Lier (2013), and their trend of results are in better agreement with the prediction of the model (**Figure 6A**), see Discussion.

#### The Role of the Luminance Contrast Between the IC and the OC

Having discussed the model's predictions to highly saturated stimuli from the literature with different variations of chromatic properties (**Figures 3**, **5**) we then tested the model's predictions for stimuli with different luminance as well as different chromatic properties. Devinck et al. (2005) and Pinna et al. (2001) showed that the magnitude of the filling-in effect increases with

FIGURE 6 | Comparison between the predictions of the model and the psychophysical findings for the assimilative and non-assimilative effects. The prediction of the model (A) and the chromatic cancelation data (B) where done for 8 different colors of the OC, similarly as Figure 4 Kimura and Kuroki (2014b). Top row (A,B) presents the experimental (B) and the predicted (A) results to stimuli with red IC. Bottom row present the experimental (B) and the predicted (A) results to stimuli with achromatic IC and the 8 different colors for the OC. In the left Column at each subfigure (A,B) the luminance of the IC is higher than the luminance of the OC. In the right column at each subfigure (A,B) the luminance of the IC is lower than the luminance of the OC as in Kimura and Kuroki (2014b).

level (dark red), while the OC has a high level of intensity. The predicted color is yellowish (I right), thus the perceived effect is a non-assimilative effect. In stimulus II, the IC has a high luminance level, while the OC has a low intensity (dark blue). The predicted color is reddish (II right), thus the perceived result is due to an assimilative effect.

increasing luminance contrast between the relevant contours. Our model predicts this effect of luminance contrast between the IC and OC. **Figure 7** presents the model predictions to a "switching" effect (non-assimilative: **Figure 7**I vs. assimilative: **Figure 7**II) whereby the luminance contrast determines whether the perceived effect will be assimilative or non- assimilative (Kimura and Kuroki, 2014a). Even though the IC color in both stars is reddish and the OC color blueish, the predicted colors are different (pale yellowish in the left star and pale reddish in the right star), **Figure 7**. It should be noted that in this case, the model's prediction is in agreement with the experimental results of Kimura and Kuroki (2014a) that showed that the luminance condition suitable for the nonassimilative color spreading is the reverse (in their Weber contrast) of the assimilative color spreading. We argue that these experimental findings (Kimura and Kuroki, 2014a) shed a new light on the common assumption in the literature that assimilative and non-assimilative are different effects and are derived from different mechanisms (Kimura and Kuroki, 2014a,b). This topic will be discussed in more detail in the Discussion.

An additional important finding relates to the claim that only the assimilative type of watercolor effect is possible when the IC and the OC have the same level of luminance (Devinck et al., 2005). Accordingly, our model predicts that the assimilative effect should be perceived under such iso-luminance conditions and also predicts that the effect will be weaker than when the IC and the OC have different luminance values.

#### The Role of the Luminance Contrast Between the Background and the Contour

Several experimental studies that tested the role of background luminance on the perceived watercolor effect (Devinck et al., 2005; Cao et al., 2011; Kimura and Kuroki, 2014a) reported that the luminance contrast between the IC and the background, and between the OC and the background have a significant influence on the perceived effect.

**Figure 8A** presents the model's predictions for a response to the same stimuli used by Kimura and Kuroki (2014a), indicating that when the background is white (high luminance), the perceived color is yellowish. In contrast, when the background is darker (low luminance, **Figures 8A,B**), there is a tendency to a more greenish perceived color. This is because a change in the luminance of the background produces a change in the contrast between the contours (IC and OC) and the background, which in turn, influences the perceived effect. Importantly, the changes in

perceived color predicted by the model were in accordance with the experimental results (Kimura and Kuroki, 2014a).

greenish when the background is darker.

**Figure 8B** demonstrates that there are three options for luminance contrast that play a role in the watercolor effect. The first one is the contrast between the IC and the OC, the second, the contrast between the IC and the background, and the third one is the contrast between the OC and the background. In **Figure 8B** the luminance of the IC is lower than in **Figure 8A**. As a result, the perceived filling-in color appears greenish in the stimulus with the white background (high background luminance). In contrast, the perceived filling-in color in **Figure 8A** appears yellowish. These perceived coloration effects were intensified in the model's simulation (**Figure 6** right) and support the suggestion that both the background and the luminance ratio between the IC and the OC contribute to the perceived effect. These predictions are in agreement with the psychophysical findings of Kimura and Kuroki (2014a).

### DISCUSSION

We present here a generic computational model that describes the mechanisms of the visual system that activate the creation of chromatic surfaces from chromatic and achromatic edges. Our hypothesis was that these mechanisms can be revealed through a study of visual phenomena and illusions, such as the assimilative and non-assimilative watercolor effect and the Craik–O'Brien– Cornsweet (COC) illusions. The suggested model can be divided into two stages (or components). The first component determines the dominancy of the edges that trigger a diffusive filling-in process. The second component performs the diffusive filling-in process, which triggers the diffusion by heat sources. This process is modeled by the Poisson equation. The diffusion process is actually the same mechanism described for the afterimage effect (Cohen-Duwek and Spitzer, 2018).

In order to test the hypothesis, we developed a computational model that is able to predict both the assimilative and the nonassimilative watercolor effects. The model predictions, which are supported by psychophysical experiments (Pinna et al., 2001; Devinck et al., 2005, 2006; Pinna and Grossberg, 2005; Pinna and Reeves, 2006; Cao et al., 2011; Coia and Crognale, 2014; Kimura and Kuroki, 2014a,b), argue that both the assimilative and non-assimilative watercolor effects are derived from the same visual mechanism. In addition, the model can successfully predict quantitatively and qualitatively the psychophysical results reported by many researchers, such as the influence of the background luminance, contour intensities, contour saturations, and the relationship between them (Pinna et al., 2001; Devinck et al., 2005, 2006; Pinna and Grossberg, 2005; Pinna and Reeves, 2006; Cao et al., 2011; Coia and Crognale, 2014; Kimura and Kuroki, 2014a,b).

#### Comparison to Other Models

The only computational model in the literature, that is relevant to the watercolor effects, is the FACADE model (Pinna and Grossberg, 2005). In a more recent publication of Pinna and Grossberg (2005), the FACADE model was challenged by testing several stimulus parameters acting in the watercolor effect, such as the role of the contrast between the IC and the OC, the role of the background luminance, and different shape variations of the stimulus. While the FACADE model could predict the results of the stimuli on the assimilative watercolor effect it was not designed to, and indeed was unable to, predict the non-assimilative watercolor effect and its properties.

The FACADE model comprises two components. The first component, the BCS, detects the borders that block the diffusion process. The second component, the FCS, spreads the color to all directions until it is blocked by edges. The FACADE model is unable to predict the non-assimilative effect first because the spread of color is derived from the chromatic surface itself, and there is no mechanism that creates complementary colors. A second reason is that the border, which is detected by the BCS, prevents the OC color of the watercolor effect from spreading inside the inner area of the stimulus.

The ability of the FACADE model to predict only the assimilative effects (Pinna and Grossberg, 2005; Pinna, 2006; Cao et al., 2011; Kimura and Kuroki, 2014a,b) has contributed significantly to the general consensus in the literature that the assimilative and non-assimilative effects are derived from different mechanisms. In contrast, Kimura and Kuroki (2014a) found strong psychophysical evidence that assimilative and nonassimilative effects both share the same Weber contrast rule under specific psychophysical constraints. However, despite these Weber rules, they concluded that the effects might still involve different mechanisms.

Unlike FACADE, two factors allow our model to predict the non-assimilative watercolor effect. First, each edge in the stimulus triggers a diffusion process. Therefore, each edge contributes to the achromatic areas i.e., the inner area and the outer area. The color adjacent to the achromatic area contributes its color i.e., triggers a diffusion process of the same color, to this area; while the color in the other side of the edge contributes the complementary color to the same area. In other words, the color in the outer side of an edge triggers a diffusion process of its complementary color. The reason why the complementary color is obtained from the model is explained in the Model section. The exact colors that will be spread are calculated by the responses of the double-opponent RFs, Equations (8–10). The resultant colors, are therefore not necessarily exactly the "same" or "complementary" to the IC/OC, but rather a linear combination of the colors of the IC and the OC. In addition, the model assumes that the main role of the contours is to trigger the diffusion process as "heat sources," (Equation 10), and not as primarily designed to block the diffusion process.

It could be claimed that additional computational models that have been suggested for edge integration should be regarded here as competitors, which can explain this filling-in mechanism of chromatic and achromatic surfaces. Rudd (2014) summarized and discussed several computational models designed to perform the edge integration function in the visual system. He argued against the idea that the filling-in effect results from the activation of a low visual spatial frequency channel, due to the fact that the spatial extent of the filling-in effect is far larger than the area or distance spanned by the lowest spatial frequency filters in human vision (about 0.5 cycle/degree) (Wilson and Gelb, 1984). It should be noted that the watercolor effect has been shown to spread over 45◦ (Pinna et al., 2001), a spatial range that is not consistent with a low spatial frequency of the visual system.

Although Rudd (2014) also argued against the diffusive fillingin mechanism, we believe that his justification was based on the specific diffusive FACADE model suggested by Grossberg and his colleagues (Grossberg and Mingolla, 1987; Grossberg, 1997; Pinna and Grossberg, 2005). According to FACADE, the chromatic edges function as borders to block the diffusive process. If the watercolor stimulus is open (unclosed boundaries), the FACADE model predicts that the color would leak from the open ends, which, in reality, does not occur. In contrast, our diffusive computational model does not fail in such a case. **Figure 9** demonstrates that our model successfully predicts this effect, because the edges in our model are used as triggers, Equation (10), rather than borders for diffusion.

Rudd (2014) suggested a qualitative "Edge integration" model, through long range receptive fields in area V4 (Roe et al., 2012). Rudd suggested that lightness and darkness "edge integration" cells in V4 could integrate the responses of V1 simple receptive fields with a light or dark direction toward the center of the V4 receptive field. An additional neuron in the higher level of the visual pathway hierarchy then integrates these receptive fields, and performs a subtraction operation between the lightness and the darkness "edge integration" receptive fields. This model qualitatively predicts specific induction effects [Figures 2, 9 in Rudd (2014)] but fails to predict classic filling-in effects, such

as the watercolor illusion that manifest filling-in in all directions and over very wide spatial regions.

Since Rudd (2014) related the induction effects to fillingin phenomena, he supplied an additional argument against the diffusive filling-in model, which is based on the model of Grossberg (Grossberg and Mingolla, 1987; Grossberg, 1997; Pinna and Grossberg, 2005). This argument is related to the FACADE model's failure to predict the specific induction effects, [Figure 2 in Rudd (2014)] and **Figure 9**.

There is currently a disagreement in the literature as to whether these specific induction effects are the result of a fillingin mechanism, an adaptation mechanism of the first order (Spitzer and Barkan, 2005), or a local or (remote) contrast mechanism (Blakeslee and McCourt, 1999, 2001, 2003, 2008). We argue that a visual effect may not necessarily be determined by a single dominant mechanism, and that several mechanisms could be involved. Different mechanisms could give rise to contradicting effects on one hand, or alternatively could work in synergy to enhance the perceived effect. An interesting question is whether this induction effect can also be predicted by our proposed model. **Figure 10** demonstrates that our fillingin model can predict the first order variation of the specific induction effect, [Figure 2 in (Rudd, 2014)]. Since this effect is predicted by our filling-in model, and also by an adaptation of the first order model (Spitzer and Barkan, 2005), we believe that the induction effect can be attributed to both mechanisms.

Experimental results show that the size of the inducer areas and the size of the induced area play a crucial role in the perceived induction effect (Shevell and Wei, 1998). The suggested filling-in model is based on edges that trigger a diffusion process, therefore the size of the induced area and the size of the inducer area do not play a role in our filling-in model. However, these two spatial factors do play a major role in the adaptation of the first order mechanism (Spitzer and Barkan, 2005).

We believe that there is a certain confusion in the literature regarding the source and the mechanisms of the induction and the filling-in effects. Kingdom (2011), for example, argued in his review that: ". . . 'filling-in' of uniform regions is mediated by neural spreading has been seriously challenged by two sets

of findings: 1. That brightness induction is near-instantaneous and 2. That the Craik–Cornsweet–O'Brien illusion is dependent on the presence of residual low-frequency information and is not disrupted by the addition of luminance noise. 'Filling-in' should at best therefore be considered as a metaphor for the representation. . . ". We argue that these claims are problematic, based on different psychophysical results (Pinna et al., 2001), and also query the feasibility of a mechanism, which is based on spatial filtering.

Kingdom (2011) assumed that these two effects of induction and other filling-in effects (the COC effect) derive from the same mechanism. For this reason, he argued against a diffusive filling-in mechanism, since a diffusive process requires more time. Kingdom (2011) also based his arguments on the findings reported by Blakeslee and McCourt (2008) that the temporal response of the induction effect (simultaneous contrast) lagged by <1 ms. In contrast, Pinna et al. (2001) found that the temporal response of the watercolor effect is about 100 ms. We believe that there is no contradiction between the two temporal results (Pinna et al., 2001; Blakeslee and McCourt, 2008), since they are associated with two different mechanisms, namely induction and the diffusive filling-in process. The first mechanism (induction of the first order) (Spitzer and Semo, 2002; Spitzer and Barkan, 2005; Tsofe et al., 2009; Kingdom, 2011) occurs in/at early visual areas, such as the retina, while the second mechanism (COC or watercolor, diffusive filling-in) occurs in a higher visual area. In addition, the spatial filling-in spread of 45◦ , reported for the watercolor illusion cannot be explained by any receptive field or low-spatial frequency channel of the visual system (Rudd, 2014).

In this context, we contend that positive and negative aftereffects (such as in "color dove illusion" and the "stars" illusion) (van Lier et al., 2009; Barkan and Spitzer, 2017), are perceived as a result of a diffusive filling-in process that cannot be explained by any spatial filtering mechanism. The reasons for this are: (1) The perceived color is obtained in an area that has not been stimulated by any color, at the time that the color is perceived [aftereffect with filling-in as in the "color dove illusion" and Van Lier "stars" (van Lier et al., 2009; Barkan and Spitzer, 2017)]. (2) The location of the achromatic reminder contour determines and triggers the perceived color. The filling-in model proposed here shares the same diffusion component, Equation (10), as suggested for the positive and the negative aftereffects (Cohen-Duwek and Spitzer, 2018). Although Kingdom (2011) supported the description of the filling-in and induction events by the filter models of Blakeslee and McCourt (2008), their model cannot predict the assimilative and the non-assimilative watercolor effects, or the aftereffects.

### Predictions for Watercolor Properties

Having discussed the options of various alternative models for the "filling-in" phenomena, we were interested to test our model's predictions with studies that define general properties and rules for the watercolor effect, although without a computational model (Kimura and Kuroki, 2014a). We have already described the success of our model in correctly predicting experimental results (Kimura and Kuroki, 2014a) demonstrating crucial properties regarding the strength of the watercolor effect and its relation to the assimilative and non-assimilative effects. We explain below how the basic structure of the suggested model can explain these findings, without requiring any additional components.

**Complementary colors**: Several studies have demonstrated that a maximal filling-in response is perceived when the IC and the OC have complementary colors (Pinna et al., 2001; Devinck et al., 2006) and it should be noted that the model correctly predicts this trend, **Figure 6**. This can be explained by the model equations (Equations 3–10), through solving the Poisson equation. The IC triggers an assimilative filling-in (of the same color as the IC) toward the inner area, while the OC triggers a non-assimilative filling-in, with the opposite color to the IC contour (**Figure 2**, i.e., its complementary color), toward the inner area. According to the model, if the color of the OC is complementary to the color of the IC, the combination of colors that diffuse to the inner area will be the same as the color of the IC (assimilative color) and complementary to the color of the OC, which makes it the same color as the IC again. Consequently, the perceived color is enhanced.

**Luminance contrast:** Several studies have reported that the magnitude of the filling-in effect increases with increasing luminance contrast between the IC and OC contours (Pinna et al., 2001; Devinck et al., 2005; Devinck and Knoblauch, 2012). This property of the luminance contrast is treated similarly to the chromatic channels. The weights of the modified gradients calculation, Equations (7–8), gives greater dominancy to the gradients between the IC and the OC. It is therefore not surprising that the model correctly predicts the importance of the luminance contrast, between the IC and the OC, in the watercolor effect.

**Saturation:** Devinck et al. (2006) showed that increasing the saturation of the outer and inner contours increases the shift in chromaticity of the filling-in effect. This information is included in the model through the chromatic opponent channel, Equation (3). Higher color saturation is expressed as a higher response in the chromatic opponent channels. This property has been tested and the model predictions show good agreement with the results of experimental studies.

**Weber rule – IC contrast/OC contrast:** Kimura and Kuroki (2014a) reported that the ratio between the IC luminance contrast and the OC luminance contrast determines the perceived filling-in effect, **Figure 8**. The IC contrast is the Weber contrast of the chromatic IC luminance and the background luminance, while the OC contrast is the Weber contrast of the chromatic OC luminance and the background luminance, Equation (21). Note that since the background is achromatic, this Weber contrast is related only to the luminance domain. Kimura and Kuroki (2014a) argued that if the IC contrast is smaller than the OC contrast, an assimilative effect is perceived, Equation (21). In contrast, if the IC contrast is larger than the OC contrast, a non-assimilative effect is perceived, Equation (21).

$$\frac{\left|L\_{IC} - L\_{B\text{kg}}\right|}{L\_{B\text{kg}}} < \frac{\left|L\_{OC} - L\_{B\text{kg}}\right|}{L\_{B\text{kg}}} \to \text{assimulative effect} \tag{21}$$

$$\frac{\left|L\_{IC} - L\_{B\text{kg}}\right|}{L\_{B\text{kg}}} > \frac{\left|L\_{OC} - L\_{B\text{kg}}\right|}{L\_{B\text{kg}}} \to non-\text{assimulative effect}$$

Where LIC, LOC, and LBkg are the luminances of the IC, OC, and the background, respectively.

Our model was tested with a variety of stimuli with different luminance backgrounds, different chromatic contours (**Figures 8A,B**), and different Weber ratios. **Figure 8** demonstrates the predictions of the Weber contrast rule with non-assimilative effect. Additional stimuli were tested, but showed a smaller perceived effect. Interestingly, the Weber contrast rule and the predictions of our model do not necessarily always yield the exact assimilative or non-assimilative colors, but rather a different color as found experimentally (Kimura and Kuroki, 2014a). For example, the stimuli in **Figures 8A,B** have the same colors (red and magenta), but because the IC in **Figure 8A** has a higher luminance than the IC in **Figure 8B**, this gives rise to a yellowish color in **Figure 8A** but a greenish color in **Figure 8B**. Note that despite the difference in luminance levels, both effects share the same trend of Weber contrast rule, and thus both appear as non-assimilative effects. The model's predictions are in agreement with the Weber contrast rules (Kimura and Kuroki, 2014a), **Figure 8**. This demonstrates that both the model and the Weber contrast rule can predict in which contrast configuration the perceived effect is assimilative or non-assimilative.

Let us explain how our model can predict this Weber contrast rule. If an IC has a high value of Weber contrast, the "heat source" located on the edge between the IC and the background has the highest value and the diffusion process from this edge has a strong influence on the perceived color. Accordingly, the color spreading from this "heat source" (the edge between the IC and the background) to the inner area has the same color as the color of the background (white **Figure 8A**), and the complementary color of the IC (cyan—the complementary color of the red IC), **Figure 8A**. The cyan color, which is a combination of green and blue, contributes to this bluish-greenish perceived effect (**Figure 8B**).

We were interested in whether the Weber contrast rule is applicable to the achromatic watercolor stimuli. Cao et al. (2011) conducted a psychophysical study in order to investigate the influence of the luminances of the IC, OC, and the background on the perceived achromatic watercolor effect. They found that the filling-in effect disappeared when the luminance of the OC was between the luminances of the IC and the background. Kimura and Kuroki (2014a) reported that the findings of Cao et al. (2011) are consistent with their psychophysical findings, and also with their suggestion for the role of the Weber contrast rule. The prediction of our model (**Figure 11**) is also in agreement

with the Kimura and Kuroki (2014a) findings. In **Figure 11**, the luminance of the OC lies between that of the IC and the background. In terms of the Weber contrast rule, the Weber contrast of the OC is smaller than that of the IC. Therefore, such a configuration should lead to a non-assimilative perceived effect. However, since the perceived color inside the star is darker than the background (**Figure 8**I); this might be seen as a diffusive effect of the IC color ("assimilative" effect), which is black. According to our model, the perceived color is a combination of the same color as the IC (black) and the complementary color of the OC (gray, which is the complementary of gray), therefore, the model correctly predicts this effect. Accordingly, the terms "assimilative" and "non-assimilative" watercolor effects are not the precise terms regarding the perceived colors of the achromatic watercolor stimuli. It should be noted that there might be a dependency of the perceived effect on the stimulus size. This property should be further investigated experimentally.

Not all experimental studies agree about the perceived color in the non-assimilative watercolor effect (Pinna, 2006; Kimura and Kuroki, 2014b). Kimura and Kuroki (2014b), for example, claim that if the luminance of the IC is low (very dark IC), the perceived filling-in effect is predominantly yellow, regardless of the OC color. Kimura reported this finding to be inconsistent with previous results reported by Pinna (2006), which described a complementary color filling-in effect with black IC and chromatic OC combinations. Additional experimental study supports the results of Pinna (2006) and the idea that complementary colors are perceived, when the IC color is dark (Hazenberg and van Lier, 2013). The model results predict that the perceived colors are predominantly complementary to the OC colors, when the IC is dark. Even though the predicted results, **Figure 6**, are predominantly complementary to the OC colors, when the IC color is dark red, the predicted colors are slightly shifted to the red IC color. When the IC is achromatic the predicted colors, **Figure 6**, are the complementary colors to the OC colors.

Our model, thus, supports the findings of Pinna (2006) and Hazenberg and van Lier (2013), **Figure 6**, and is not in agreement with Kimura and Kuroki (2014b) because the chromatic OC triggers a filling-in effect that is complementary to the inner area, and therefore the perceived color will be complementary to the OC (the IC is achromatic and so does not contribute any color to the effect).

#### Model's Predictions for the COC Effect

Although our model is mainly concerned with the predictions of the watercolor illusions, there are a number of other examples of filling-in effects, including the COC effect. We believe that the COC effect is driven solely by a diffusion mechanism, since the physical stimulus in this effect is only an edge. The model prediction for the COC effect, which is demonstrated in **Figure 12**, uses the same set of parameters as the watercolor illusions (**Figures 3**, **5**, **7**–**9**, **11**). Our suggestion that both phenomena (watercolor and COC) are related to the same visual mechanism, is in agreement with (Devinck et al., 2005; Todorovi, 2006; Cao et al., 2011) who showed that the watercolor stimulus profile is a discrete version of the COC stimulus profile. The success of the model prediction of the COC effect supports the suggestion that both effects (which are physically built only from edges) share the same "heat sources" diffusion mechanism, which is triggered by edges. The COC effect can actually be considered as a simpler case of the diffusive filling-in effect than the watercolor effects.

There are three main classes of computational models that have been used to investigate the COC effect. The first class is called the "Diffusive models" (Grossberg and Mingolla, 1987). Grossberg and Mingolla (1987) showed that the FACADE model can correctly predict the COC effect. Nevertheless, the FACADE model, in this case, can predict the COC effect when the stimulus contains open boundaries, but only through using an additional component that detects illusory contours, **Figure 12**. The illusory contours component will detect the illusory edges around the COC stimulus (**Figure 12**), and will prevent the color from spreading. However, this component is not necessary for the watercolor illusion, which can contain open boundaries. **Figure 9** presents, for example, open boundaries, and it can be seen that there is no perceived effect of edge completion (illusory contour). It has to be noted that the suggested model does not include a component that detects illusory contours, and therefore our model does not predict filling-in effects that involve illusory contours e.g., "Neon Color Spreading." Our model suggests that the illusory contours components are not necessary for the watercolor mechanism.

The second class of models is termed the "Spatial filtering models," where these models utilize low-frequencies spatial filters in order to predict the filling-in effects (Morrone et al., 1986; Burr, 1987; Morrone and Burr, 1988; Ross et al., 1989; Blakeslee and McCourt, 1999, 2001, 2003, 2004, 2005; Dakin and Bex, 2003; Blakeslee et al., 2005; Kingdom, 2011). We argue that the spatial filtering approach has limitations in predicting the COC effect because the filling-in can be spread to sizes which cannot be explained by the sizes of the receptive fields that exist in the LGN or V1–V2 cortical areas. In addition, the COC effect can be obtained from edges that are built only from ODOG (Oriented Difference of Gaussian) filters (Blakeslee and McCourt, 1999).

The third class of models is termed the "Empirical models." These models are designed to estimate the most likely reflectance values based on the pattern of the luminances observed in the image, together with learnt image statistics (Purves and Lotto, 2003; Brown and Friston, 2012). Typically, such an Empirical approach may explain why we perceive these visual effects, but cannot explain the neuronal mechanisms that lead to the perceived effects.

#### Neuronal Sources of the Filling-In Effect

Studies designed to identify the neuronal source of the filling-in effects that are triggered by edges, e.g., the watercolor and the COC effects, can shed additional light on the possible neuronal mechanisms. A recent fMRI study (Hong and Tong, 2017) compared the responses of the visual areas (V1–V4) to real colored surfaces and to illusory filled-in surfaces, such as occur in the afterimage effect of van Lier "stars"(van Lier et al., 2009). Hong and Tong (2017) found a high correlation between the two types of stimuli, the real and the illusory, only in areas V3 and V4. They, therefore concluded that the perception of filled-in surface color occurs in the higher areas of the visual cortex.

Rudd (2014) suggested an "edge integration" model that works through long range receptive fields in area V4 (Roe et al., 2012). Both the qualitative (Rudd, 2014) model and (Hong and Tong, 2017) experiments support the idea that the source of the filling-in mechanism is located in V4. It has to be noted that our computational model can be regarded as this diffusion process but also does not contradict a mechanism of edge integration that can be derived from long range receptive fields (Rudd, 2014). This "edge integration" mechanism can also be symbolic and appear as a diffusion process.

As already discussed, we argue that both the watercolor effect and the COC effect share the same visual mechanism; therefore, we would expect to identify a similar neuronal source for both effects. A literature survey of experimental studies that investigated these sources revealed a lack of consensus regarding the neuronal source of the COC effect. A few studies reported that the effect occurs in low visual areas: the LGN, V1 and V2 (MacEvoy and Paradiso, 2001; Roe et al., 2005; Cornelissen et al., 2006; Huang and Paradiso, 2008), while other studies showed evidence that the effect occurs in higher areas of the visual system such as the V3 and caudal intraparietal sulcus (Perna et al., 2005). It is possible that there is no complete overlap between the cortical areas responsible for the COC effect and the watercolor effect, since the watercolor effect commonly involves color, while the COC effect involves achromatic stimuli.

Our model succeeds in predicting apparently conflicting perceived filling-in triggered-by-edges phenomena, e.g., the assimilative and the non-assimilative watercolor effects. The suggested mechanism is a filling-in process which is based on reconstruction of an image from its modified edges. The diffusion process, thus, is calculated by solving the heat equation with heat sources (Poisson equation). The edge of the trigger stimulus are modified by the model according to rules of dominancy, and computed as the heat sources in the Poisson equation. We therefore suggest that this model can predict all the fillingin-triggered-by-edges effect in both the spatial and temporal domains (Cohen-Duwek and Spitzer, 2018).

#### REFERENCES


The challenge of "The interaction of the mechanisms underlying boundary and surface perception is an essential problem for vision scientists" has been presented previously (Cao et al., 2011). Here we introduce a new computational model that describes and predicts how any boundary can "create" surfaces by a filling-in process.

#### DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the supplementary files.

# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.


Wandell, B. A. (1995). Foundations of Vision. Sunderland, MA: Sinauer Associates. Weickert, J. (1998). Anisotropic Diffusion in Image Processing. Stuttgart: Teubner.

Wilson, H. R., and Gelb, D. J. (1984). Modified line-element theory for spatial-frequency and width discrimination. JOSA A 1, 124–131. doi: 10.1364/JOSAA.1.000124

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Cohen-Duwek and Spitzer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Cross-Recurrence Analysis of the Pupil Size Fluctuations in Steady Scotopic Conditions

#### Pietro Piu<sup>1</sup> , Valeria Serchi<sup>1</sup> , Francesca Rosini1,2 and Alessandra Rufa1,2 \*

<sup>1</sup> Eye Tracking and Visual Application Lab, Department of Medicine, Surgery and Neuroscience, University of Siena, Siena, Italy, <sup>2</sup> Neurology and Neurometabolic Unit, Department of Medicine, Surgery and Neuroscience, University of Siena, Siena, Italy

Pupil size fluctuations during stationary scotopic conditions may convey information about the cortical state activity at rest. An important link between neuronal network state modulation and pupil fluctuations is the cholinergic and noradrenergic neuromodulatory tone, which is active at cortical level and in the peripheral terminals of the autonomic nervous system (ANS). This work aimed at studying the low- and high-frequency coupled oscillators in the autonomic spectrum (0–0.45 Hz) which, reportedly, drive the spontaneous pupillary fluctuations. To assess the interaction between the oscillators, we focused on the patterns of their trajectories in the phase-space. Firstly, the frequency spectrum of the pupil signal was determined by empirical mode decomposition. Secondly, cross-recurrence quantification analysis was used to unfold the non-linear dynamics. The global and local patterns of recurrence of the trajectories were estimated by two parameters: determinism and entropy. An elliptic region in the entropydeterminism plane (95% prediction area) yielded health-related values of entropy and determinism. We hypothesize that the data points inside the ellipse would likely represent balanced activity in the ANS. Interestingly, the Epworth Sleepiness Scale scores scaled up along with the entropy and determinism parameters. Although other non-linear methods like Short Time Fourier Transform and wavelets are usually applied for analyzing the pupillary oscillations, they rely on strong assumptions like the stationarity of the signal or the a priori knowledge of the shape of the single basis wave. Instead, the cross-recurrence analysis of the non-linear dynamics of the pupil size oscillations is an adaptable diagnostic tool for identifying the different weight of the autonomic nervous system components in the modulation of pupil size changes at rest in non-luminance conditions.

Keywords: pupil diameter, cross-recurrence quantification analysis, empirical mode decomposition, Epworth Sleepiness Scale, Gaussian-copula

#### Edited by:

Xavier Otazu, Autonomous University of Barcelona, Spain

#### Reviewed by:

Pablo De Gracia, Midwestern University, United States Miriam Schwalm, Johannes Gutenberg University Mainz, Germany

> \*Correspondence: Alessandra Rufa rufa@unisi.it

#### Specialty section:

This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience

Received: 22 September 2018 Accepted: 10 April 2019 Published: 30 April 2019

#### Citation:

Piu P, Serchi V, Rosini F and Rufa A (2019) A Cross-Recurrence Analysis of the Pupil Size Fluctuations in Steady Scotopic Conditions. Front. Neurosci. 13:407. doi: 10.3389/fnins.2019.00407

**182**

**Abbreviations:** ANS, autonomic nervous system; CRQA, cross-recurrence quantification analysis; DET, percentage of determinism calculated from the cross-recurrence analysis; EmbDim, embedding dimension, a hyper-parameter to be estimated for the CRQA; EMD, empirical mode decomposition; ENT, entropy calculated from the cross-recurrence analysis; ESS, Epworth Sleepiness Scale; FAN, fixed amount of nearest neighbor; HF, high-frequency component in the ANS spectrum; IMFs, intrinsic mode functions extracted through the EMD; LF, low-frequency component in the ANS spectrum; MUSIC, multiple signal classification algorithm; R, neighborhood radius, a hyper-parameter to be estimated for the CRQA; RQA, recurrence quantification analysis; TD, time delay, a hyper-parameter to be estimated for the CRQA.

# INTRODUCTION

fnins-13-00407 April 26, 2019 Time: 14:50 # 2

The pupil controls the amount of light radiations reaching the retina, by modulating its diameter through the interaction of two muscles under sympathetic-parasympathetic control. The pupil constriction is regulated by the contraction of the iris sphincter muscle receiving parasympathetic innervation mainly through cholinergic fibers. The pupil dilatation is instead related to the contraction of the radial muscle of the iris, under sympathetic control (Loewenfeld and Lowenstein, 1993). Due to the wellknown neuroanatomical substrate, the clinical examination of the pupillary light reflex is considered an indicator of the optic nerve conduction, brainstem integrity, vigilance and coma. In recent years, studies in rodents and non-human primates found a tight coupling between pupil size and cortical state even during quiet wakefulness, suggesting a non-luminance mediated system for pupil size variations, associated to neural network oscillations. Studies combining electrophysiology, optical imaging and neural networks modeling, indicated that the link between brain state activity and pupil size is related to the neuro-modulatory effect of the noradrenergic and cholinergic systems (Murphy et al., 2014; Costa and Rudebeck, 2016; Joshi et al., 2016; Eckstein et al., 2017). In this respect, a direct relationship between pupil size and moment-to-moment fluctuations in the activity of noradrenergic neurons of the brainstem locus coeruleus (LC) has been verified (Aston-Jones and Cohen, 2005; Nassar et al., 2012). Other forebrain nuclei and cortical areas connected to LC are activated during spontaneous and event driven pupil size changes (Wang and Munoz, 2015; Joshi et al., 2016) suggesting a circuit for pupil response, linked to arousal, attention and perception systems (Jones, 2004; Naber et al., 2013; Wang and Munoz, 2015; Fazlali et al., 2016; Reimer et al., 2016; Larsen and Waters, 2018). Overall, these studies outline a new role for the pupil size monitoring as a reliable and non-invasive peripheral marker of rapid brain state changes (Hartmann and Fischer, 2014; Schwalm and Jubal, 2017).

From a methodological point of view, a challenge in the analysis of the pupil size variations is the identification of specific patterns that may be representative of changes in the cortical state activity. Different methods have been proposed to assess the pupillary spontaneous oscillations in isoluminant–non-accommodation inducing conditions or in the dark (Lüdtke et al., 1998; Pong and Fuchs, 2000; Zénon et al., 2014; Zénon, 2017). According to the assumptions those methods meet, we distinguish: stationary and linear assumption meeting methods, non-linear assumption meeting methods and non-linear and non-stationary assumption meeting methods. Like other physiological non-stationary signals, under steady stimulation, the pupillary oscillatory signal is expected to show non-linear and chaotic patterns (Poon and Merrill, 1997; Morad et al., 2000; Wilhelm et al., 2001; Merritt et al., 2004; Muppidi et al., 2013; Regen et al., 2013). The non-linear methods assume that the dynamics of the pupil size follow the rules of deterministic chaos rather than a stochastic or linear process (Rosenberg and Kroll, 1999). Common non-linear methods for the analysis of pupillary oscillations imply the use of the Short Time Fourier Transform (Nowak et al., 2008) and wavelets transformations (Henson and Emuh, 2010; Nowak et al., 2013; Reiner and Gelfeld, 2014). These methods assume an underlying stationary signal or require an a priori knowledge of the shape of the single basis wave; assumptions that do not well reflect the pupillary dynamics (Onorati et al., 2016). Among the most recent proposed non-linear and non-stationary meeting methods for the analysis of the pupil oscillations, there are the Hilbert Huang Transform, the EMD (Ruiz-Pinales et al., 2016; Villalobos-Castaldi et al., 2016), and the recurrence plots (Mesin et al., 2013, 2014; Monaco et al., 2014). The Hilbert-Huang transform is a frequency domain transformation, with the advantage of maintaining a good temporal and frequency resolution. Through the EMD, the original signal is split into components with slowly varying amplitude and phase, also known as IMFs. By applying a Hilbert transform to the IMF, instantaneous frequencies are generated as functions of time that give sharp identifications of embedded structures (Barnhart, 2011; Ruiz-Pinales et al., 2016; Villalobos-Castaldi et al., 2016). The RQA consists in taking single physiological measurements, projecting them into multidimensional space by embedding procedures and in identifying correlations that are not apparent in onedimensional time series. This method provides quantitative indexes related to the number and duration of recurrences of the trajectory of a dynamical system in the phase space (Marwan, 2008; Webber and Marwan, 2015). Then, by applying the crossrecurrence analysis (CRQA) which is a bivariate extension of the RQA, we can investigate the dynamic interactions among the systems modulating pupil size oscillations. The use of CRQA has the advantage to better capture the recurring properties of a dynamic system given by the interaction over time of streams of information (Marwan, 2008; Coco and Dale, 2014). For this purpose, the EMD and CRQA were applied in succession. The main goal of our analysis was the identification of specific frequency components of the oscillatory signal comprised in the range of ANS, that could be quantified by couples of DET and ENT lying within the 95% prediction ellipse. Our result suggests that, in awake healthy subjects at rest, pupils oscillate in darkness with high frequency (HF) and low frequency (LF) components that are in the range of ANS, suggesting a balance between noradrenergic/cholinergic tone. Moreover, the position of the points on the ENT-DET plane seems to be related to the ESS score, and therefore, could give insights into the sleepiness state.

#### MATERIALS AND METHODS

#### Participants

Twenty-six healthy subjects participated to the study (average age 36 ± 13 years old). The participants did not have neurological deficits or serious refractive problems. Moreover, the participants did not assume caffeine in the 2 h preceding the data collection (Wilhelm et al., 2014), and they reported to have slept more than 6 h in the night before the recording (average sleep hours 7.2 ± 0.1). The data collection was performed always between 3 and 6 pm. All subjects gave their written informed consent and the study respected the Declaration of Helsinki and was approved by the local Ethics Committee (Comitato Etico Locale Azienda Ospedaliera Universitaria Senese, EVAlab protocol CEL no. 48/2010).

#### Experimental Setting

fnins-13-00407 April 26, 2019 Time: 14:50 # 3

Pupil diameter recordings were performed monocularly with an ASL 504 eye-tracker device (Applied Science Laboratories, Bedford, MA, United States) sampling at a mean frequency of 240 Hz. The remote eye-tracker was placed at 650 mm far from the eye of the participant. The relative position of the subject's head with respect to the eye-tracker was kept still by mean of a chinrest.

#### Acquisition Protocol

Prior the data collection, the subjects were administered with an test ESS to investigate their vigilance state. ESS scores less than 11 are normally associated to subjects having normal sleepy state, while ESS scores greater than 11 suggested excessive daytime sleepiness (Parkes et al., 1998). ESS is a common used selfassessment questionnaire for the tiredness evaluation, hence it can turn to be a bias-prone measurement.

All the recordings were performed in a quiet light-controlled environment. To avoid the stimulation of pupillary light reflex, the subject was instructed to look straight for 15 min in a complete dark room (0 lux), similarly to the procedure adopted by Lüdtke et al. (1998). To reduce mental activity and cognitive load, subjects were instructed to try not to think to anything and to relax.

#### Data Processing

The flow chart of procedure employed to analyze the pupillary frequency balancing between the sympathetic and parasympathetic systems is shown in **Figure 1**. The pupil diameter data was exported in comma separated values format files and analyzed offline through Matlab (The Mathworks). The signal was de-blinked. Signal instances with the pupil diameter equal to zero were marked as blinks and removed from the signal. The remaining signal was then linearly interpolated. Moreover, machine artifacts introduced by the eye-tracker device due to failures to detect the pupil, were removed using Hampel filtering and low-pass filtered with a cut-off frequency (f0) of 2 Hz. The Hampel function computes the median of the data within moving windows. The width of the filter window (w) was determined accordingly to the ratio of the sample frequency (fs) over the cut-off frequency f<sup>0</sup> (Equation 1):

$$\mathbf{w} = \mathbf{0}.44 \cdot \mathbf{f}\_{\mathbf{s}}/\mathbf{f}\_0 \tag{1}$$

The variation of pupil size was computed with respect to a baseline value of the pupil estimated for each participant. Specifically, the baseline value of the pupil diameter signal was determined as the maximum value of the pupil size attained in the first 60 s of the signal in darkness (baseline), when the signal was expected to be more stable. The mean or the median of the pupil size were possible alternative reference values. However, taking the maximum value as reference enabled us to normalize the signal on the basis of a really observed value and to preserve the dynamics of the phenomenon. A baseline-corrected

FIGURE 1 | The flow chart presents the major procedures adopted for the analysis of the pupil size oscillation, from the data pre-treatment (deblink and artifact removal) and normalization, to the final drawing of the prediction ellipse. Data points of the prediction region in the entropy-determination plane underwent a further classification analysis and a pairwise comparison of the identified clusters was also done.

pupil diameter time series was then calculated as the diameter percentage change with respect to the value gathered in the basal

baseline value.

FIGURE 3 | The panels represent the HF and LF components extracted from the pupil % change time series of an healthy subject. The IMFs obtained through the application of the EMD technique whose frequency content was inside the range [0.15 – 0.45 Hz] were aggregated and form the HF component of the ANS activity (lower panel), while the IMFs in the range [0 – 0.15 Hz] gave rise to the LF component (upper panel).

condition (Equation 2).

$$\%change = \frac{X\_t - Baseline}{Baseline} \cdot 100\tag{2}$$

where X<sup>t</sup> is the pupil diameter recorded at time t. The baseline correction provided the removal of inter-subject variability in pupil size percentage change of the pupil diameter signal (Lowenstein et al., 1963; **Figure 2**).

#### Data Analysis

A cubic spline interpolation was used for compressing the percentage change time series with a resolution of five data points per second, which satisfied the Nyquist criterion (for the given 2 Hz cut-off frequency). The EMD was applied to the cubic spline interpolation of the percentage change time series in the autonomic frequency band ranging from 0 to 0.45 Hz (Huang et al., 1998). Since we were interested in a global spectral characterization of the IMFs derived from the EMD, the spectral content of the IMFs was estimated through MUSIC algorithms (Schmidt, 1986). The IMFs having most of the power in the autonomic frequency band were retained. The IMFs were then aggregated accordingly to the HF (0.15–0.45 Hz) and LF (0–0.15 Hz) ranges related to the parasympathetic and sympathetic systems activity (Cabrerizo et al., 2014; **Figure 3**). A CRQA was performed (Marwan and Kurths, 2002; Marwan, 2016) to assess the similarity between the dynamics of the parasympathetic and sympathetic processes by comparing the interaction between the LF and the HF components in the phase space. Three hyper-parameters must be set in the CRQA: EmbDim, TD, and the neighborhood radius (R). A symplectic geometry-based algorithm was used for estimating the EmbDim (Lei et al., 2002). The TD value was chosen as the one within the range (0: w/EmbDim) that maximized the sample entropy of the percentage change of pupil size. A FAN was taken as the neighborhood criterion, such that the cross-recurrence point density had a fixed predetermined value of 20%.

Two main parameters from CRQA were considered: the determinism (DET), which quantifies the fraction of periodic structures in the trajectories of the LF and HF dynamics in the phase space, and the entropy (ENT), which is the Shannon entropy of the diagonal line length distribution. Periodic signals are expected to yield high values of DET and small values of ENT (Marwan et al., 2007). To enlarge the sample size enough to apply clustering procedures on the DET-ENT plane and to investigate more carefully for possible highlights of this method on the analysis of the balancing of the sympathetic and parasympathetic systems thorough the analysis of the oscillations of the pupil diameter, we employed the Gaussian-copula simulation approach. Hence, firstly the association among age, ESS, ENT, and DET was measured by the Pearson's correlation matrix. Then, a Gaussian-copula which maintained the dependence structure was used to generate one-hundred correlated multivariate data of those variables.

The percentage of pupil size change of each of the simulated data was then represented as a point on the DET-ENT plane.

#### Statistical Analysis

On the simulated dataset the Doornik-Hansen multivariate normality test (Doornik and Hansen, 2008) was performed to verify the null hypothesis that the points in the DET-ENT plane were generated from a bivariate Gaussian distribution. The 95% prediction ellipse was calculated around the mean of observed points in the DET-ENT plane. Equations 3–4 indicate the formula for determining the length of the two semi-axes:

fnins-13-00407 April 26, 2019 Time: 14:50 # 5

TABLE 1 | Sampling distributions of age, Epworth Sleepiness Scale, entropy, determinism, and average pupil change.

$$a\_{\rm x} = 2 \cdot \sqrt[2]{\lambda\_2 \cdot \frac{(n\_{\rm obs} - 1) \cdot n\_{\rm var} \cdot f \left(1 - \alpha, \ n\_{\rm var}, \ n\_{\rm obs} - n\_{\rm var}\right)}{n\_{\rm obs} - n\_{\rm var}}} \tag{3}$$

$$a\_{\mathcal{Y}} = 2 \cdot \sqrt[2]{\lambda\_2 \cdot \frac{(n\_{obs} - 1) \cdot n\_{var} \cdot f \left(1 - \alpha, \ n\_{var}, \ n\_{obs} - n\_{var}\right)}{n\_{obs} - n\_{var}}} \tag{4}$$

where a<sup>x</sup> and a<sup>y</sup> are the major and minor semi-axes of the ellipse, nvar is the number of variables (=2), nobs is the number of the observations (=100), λ<sup>1</sup> and λ<sup>2</sup> are the eigenvalues (in descending order) obtained from the spectral decomposition of the covariance matrix of ENT and DET, f is the pdf of the F distribution for the given significance α level and degrees of freedom (nvar, nobs-nvar). The orientation of the ellipse is given (in radians) by the direction of the eigenvector associated to the largest eigenvalue:

$$\theta = \operatorname{atan} \left( \frac{\nu\_{\mathcal{V}}}{\nu\_{\mathcal{X}}} \right) \tag{5}$$

where atan is the inverse tangent function, and v<sup>x</sup> and v<sup>y</sup> are the components of the eigenvector corresponding to the largest eigenvalue. The coordinates of the points [Px, Py] laying on the ellipse contour are calculated as follows:

$$P\_{\mathbf{x}} = \left. \times\_{\mathbf{c}} + \left[ \frac{a\_{\mathbf{x}}}{2} \cdot \cos \left( t \right) \cdot \cos \left( \vartheta \right) - \frac{a\_{\mathbf{y}}}{2} \cdot \sin \left( t \right) \cdot \sin \left( \vartheta \right) \right] \tag{6}$$

$$P\_{\mathcal{Y}} = \mathcal{Y}\_{\mathbb{C}} + \left[\frac{a\_{\mathbb{X}}}{2} \cdot \cos\left(t\right) \cdot \sin\left(\vartheta\right) + \frac{a\_{\mathbb{Y}}}{2} \cdot \sin\left(t\right) \cdot \cos\left(\vartheta\right)\right] \tag{7}$$

where x<sup>c</sup> and y<sup>c</sup> are the coordinates of the center of the ellipse, and t ranges in the interval [0, 2π].

The prediction region can provide the regulatory reference points for assessing if the underlying slow oscillations in the autonomic band of a new observed pupil size time series have the characteristics of a normal pattern.

Afterward, unsupervised clustering through K-means method with two clusters and a L1-norm distance function was applied within the elliptic prediction area. The two clusters were compared in covariance matrices and means vectors. Accordingly, the Box's M-test was considered for verifying the homogeneity of the covariance matrices, and the Hotelling's T 2 test was used for testing the means. The variables age, ESS and % change associated to each cluster were separately compared through the Mann-Whitney unpaired test.

All statistical tests were two-sided and performed on Matlab with a 5% level of significance.

#### RESULTS

Self-organized adaptive systems like the brain generate complex signals which are inherently non-linear and non-stationary. Furthermore, unstable, weak, and state-dependent phase-locking characterizes the coupling between the biological oscillators


(Shockley et al., 2002). Since the couplings between biological signals could also be predominantly transient, the canonical techniques of signal analysis, which basically rely on the assumption of stationary signals, are not appropriate. More importantly, the autonomic control of the spontaneous pupil fluctuations is expected to have non-linear/chaotic dynamics which can be well explored by recurrence analysis methods, whose domain is in the phase-space trajectories (Mesin et al., 2013). For these reasons, we chose the cross-recurrence method to analyze the spectral components of the ANS activity controlling the pupil fluctuations.

The EMD method was applied to the time series of pupil size variations to extract the low and high frequency components of the signal, which were found in the range of the ANS band. In fact, the EMD procedure, which is known to deal with nonlinear and non-stationary signals like the pupil size oscillations, is a data driven method that overcomes the limitation of basis function shape typical of the wavelet decomposition method (Gonalves et al., 2007). The CRQA was then performed over the high and low frequency components and two parameters, i.e., entropy (ENT) and determinism (DET), were retained as the major features which quantified the non-linear dynamics of the high- and LF coupled oscillators in the autonomic band.

In **Table 1** the sampling distributions of age, ESS scores, ENT, DET and average pupil change are reported. The sample


TABLE 2 | Values of age, Epworth Sleepiness Scale, entropy, determinism, and average pupil change generated from a Gaussian-copula.

declared a normal level of diurnal drowsiness (ESS: mean = 7.3; SD = 3.7). Five subjects (four of age lower than 30, and one of age greater than 60) reported relatively high ESS scores (>10). The cross-recurrence analysis returned low values both for ENT (mean = 0.92; SD = 0.09) and for DET (mean = 45.28%; SD = 7.46%). The percentage of pupil change in the sample (%

FIGURE 4 | The contour of the ellipse envelopes the region in the entropy-determinism space where new measured combinations of entropy (ENT) and determinism (DET) will fall with a 95% probability. The null hypothesis that the two variables are generated from a bivariate normal distribution was verified through the Doornik-Hansen test. Values of DET in the y-axis have been rescaled in the range [0 – 1]. A K-means algorithm was used to characterize sub-areas within the prediction ellipse and two distinct clusters were found out. The thick black crosses indicate the centres of the clusters. Cluster 2 (squares) exhibited significant higher combination of ENT and DET than cluster 1 (circles). In addition the ESS resulted greater in cluster 2 compared to cluster 1. This finding suggested that the ENT-DET bivariate distribution inherently conveys information about the drowsiness level of the subjects.

TABLE 3 | Parameters of the 95% prediction ellipse in the ENT-DET plane.


change) (mean = −0.16%; SD = 1.91%) indicated an overall slight loss of pupil size with respect to the baseline, but high variability of the fluctuations as well.

We firstly analyzed the possible association among the observed age, ESS, ENT, and DET. Based on the results, the DET and ENT variables were not significantly correlated to the age and the ESS score of the participants. Instead, a significant correlation between ENT and DET was found (r = 0.58, p = 0.002).

The bivariate distribution of ENT and DET obtained from Gaussian-copula simulated points (**Table 2**) is depicted in **Figure 4**, together with the 95% prediction ellipse. The simulated values of ENT (mean = 0.90; SD = 0.09) and DET (mean = 43.66%; SD = 7.43%) were consistent with the values observed in the sample.

The major parameters of the prediction ellipse are displayed in **Table 3**. The coordinates of the center of the ellipse are the means of the simulated ENT and DET vectors. The axes of the TABLE 4 | Normative intervals of determinism by ranges of entropy.


TABLE 5 | Descriptive statistics of the clusters identified within the prediction ellipse.


ellipse indicate the magnitude of the inertia along the directions of ENT and DET. The interval estimations of ENT and DET were obtained through Equations 6 and 7. **Table 4** displays the expected intervals of determinism for equally spaced intervals (0.05 bits) of entropy.

Through the K-means procedure, two clusters of points were identified within the prediction ellipse and their descriptive statistics is shown in **Table 5**.

The generated ENT-DET values underwent the Doornik-Hansen multinormality test. The hypothesis of bivariate normal distribution was not rejected (DH statistic = 6.97, p = 0.14).

The covariance matrices of the clusters were not significantly different (Box's M-test = 3.2; p = 0.37). The result of the Hotelling T 2 test indicated that the bivariate ENT-DET means vectors between the clusters were significantly different (T <sup>2</sup> = 200.8; p < 0.0001). The two clusters exhibited also significant different ESS scores (U-test = 701.5; p = 0.002), whilst they did not differ in age (U-test = 1073; p = 0.64), nor in % change (Utest = 933.5; p = 0.14).

#### DISCUSSION

The analysis of the pupil size oscillations is a promising diagnostic tool, enabling improvements in the identification of cortical state changes. Variations of cortical state activity during wakefulness have a strong influence on neural, perceptual and behavioral responses. Pupil diameter varies not only in response to variation of luminance and accommodation, but also during changes in alertness, attention, mental effort and decision making, suggesting a direct link between pupil size variation and cortical state changes (Preuschoff et al., 2011;

Nassar et al., 2012; Naber et al., 2013; Alnæs et al., 2014; de Gee et al., 2014). Changes in the cortical state are associated to well characterized variations of the cortical signal frequency. Specifically, in awake rodents the investigation of local field potentials demonstrated the prevalence of LF fluctuations during periods of quiet resting. However, the initiation of locomotion or whisking was related to the suppression of low frequency components and increased high frequency oscillations (Poulet et al., 2012; Eggermann et al., 2014, McGinley et al., 2015b). This transition between slow and fast cortical activity was also observed across cortical regions (Poulet and Crochet, 2019). Electrophysiological studies have revealed that pupil constriction is associated to slow and synchronous cortical responses and inattentive behavior. Conversely, the cortical activation during task engagement or locomotion shows a persistent desynchronized neuronal activity associated to the dilatation of the pupil (Reimer et al., 2014, 2016; McGinley et al., 2015b; Schwalm and Jubal, 2017). Pupil size fluctuations and cortical state variations are modulated by the central noradrenergic and cholinergic pathways. Thus, monitoring pupil dynamics could be a reliable proxy of the changes in cortical states (Reimer et al., 2014, 2016; McGinley et al., 2015a,b). More specifically, the release of acetylcholine (Ach) from the basal forebrain and noradrenaline (NA) from LC have been shown to drive both the state of cortical connectivity and the pattern of the pupil size oscillations also in resting conditions (Reimer et al., 2016; Schwalm and Jubal, 2017). At the peripheral level, both Ach and NA are neurotransmitters of the ANS (parasympathetic and sympathetic systems, respectively) also controlling the pupil diameter. Overall, these premises encourage exploring new and reliable techniques for pupil dynamics monitoring that allow the identification of parameters attributable to NA and Ach modulatory effect in various cortical state changes.

We propose here, a method that can be used as a quantitative measurement of the non-linear dynamics of the pupil fluctuations. We applied a cross-recurrence technique for estimating determinism (DET) and entropy (ENT) features and their distribution, in order to quantify the degree of coupling between the oscillators of the low (LF) and high frequency (HF) components of the pupillary signal. To the best of our knowledge this is the first study on the use of the ENT-DET plane for analyzing the dynamical systems associated to pupil size fluctuation during stationary scotopic visual conditions.

In our cohort of subjects, we observed low levels of determinism (<60%) and entropy (<1). This is consistent with spontaneous physiological signals recorded from healthy subjects, which are expected to be highly complex. Actually, low determinism can be associated to increase in the uncertainty of the signals, and hence to increase in the signal chaotic properties (i.e., complexity). In facts, complex systems are typically highly ordered. Therefore, they tend to preserve low entropy and counteract the second law of thermodynamics (free energy principle). A de-complexification process occurs when free-running physiological signals present sustained loss of complexity. The loss of complexity leads to less ordered states with higher entropy and with stronger coupling of the oscillators controlling the expression of the signal. This degradation in complexity is typically observed in pathological conditions or advanced aging. Therefore, the major result of this study is the identification of a normative elliptical region in the ENT-DET plane for the pupillary oscillators that could be compared with data from group of patients with neurodegenerative diseases. We hypothesize that the occurrence of points outside of the defined elliptical prediction region may signal potential pathological conditions related to alterations in the ANS. As secondary outcome, we observed that, within the elliptical region of confidence, clusters of points with different characteristics of ENT-DET highly differed also in their ESS scores. This finding suggests that the location of the points in the ENT-DET plane can also reveal alterations in the sleepiness state.

Our results indicate that in resting wakefulness conditions, without the influence of light and accommodation, pupil size oscillations are under the effect of a balanced cholinergic/noradrenergic tone. We believe that the employed CRQA-based method may help to lay the groundwork for studying the LF and HF components of the pupil, which may be related to neuronal network state of the brain at rest. Importantly, it consists in a non-invasive procedure that could be easily adopted in clinical context and for diagnostic assessment such as neurodegenerative conditions. Furthermore, this method is adaptable to different experimental conditions (e.g., variations of the visual stimulus, recording during cognitive tasks, etc) provided that the opportune frequency components are dug out from the signal. The joint recording of the pupil size fluctuations along with other physiological signals (e.g., heart rate variability, EEG, etc) would improve the method, since the study of possible time-dependent and/or frequency-related changes in autonomic functions would be facilitated by this integration.

# ETHICS STATEMENT

The study was approved by the local Ethical Committee Comitato Etico Locale Azienda Ospedaliera Universitaria Senese, EVAlab protocol CEL no. 48/2010.

# AUTHOR CONTRIBUTIONS

All authors conceived and designed the study, critically revised the manuscript, and approved the final version of the manuscript. FR and VS acquired the data. AR, PP, and VS involved in the analysis and interpretation of data, and drafted the manuscript. AR revised the scientific content of the study.

### ACKNOWLEDGMENTS

We thank particularly Dr. Gemma Tumminelli for the help in the recruitment of the participants and for the data collection.

# REFERENCES

fnins-13-00407 April 26, 2019 Time: 14:50 # 9


variability in human cognitive neuroscience. eNeuro 4:ENEURO.0293-16.2017. doi: 10.1523/ENEURO.0293-16.2017


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Piu, Serchi, Rosini and Rufa. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fnins-13-00407 April 26, 2019 Time: 14:50 # 10

# Bio-Inspired Presentation Attack Detection for Face Biometrics

Aristeidis Tsitiridis\*, Cristina Conde, Beatriz Gomez Ayllon and Enrique Cabello\*

Computer Science and Statistics, King Juan Carlos University, Móstoles, Spain

Today, face biometric systems are becoming widely accepted as a standard method for identity authentication in many security settings. For example, their deployment in automated border control gates plays a crucial role in accurate document authentication and reduced traveler flow rates in congested border zones. The proliferation of such systems is further spurred by the advent of portable devices. On the one hand, modern smartphone and tablet cameras have in-built user authentication applications while on the other hand, their displays are being consistently exploited for face spoofing. Similar to biometric systems of other physiological biometric identifiers, face biometric systems have their own unique set of potential vulnerabilities. In this work, these vulnerabilities (presentation attacks) are being explored via a biologically-inspired presentation attack detection model which is termed "BIOPAD." Our model employs Gabor features in a feedforward hierarchical structure of layers that progressively process and train from visual information of people's faces, along with their presentation attacks, in the visible and near-infrared spectral regions. BIOPAD's performance is directly compared with other popular biologically-inspired layered models such as the "Hierarchical Model And X" (HMAX) that applies similar handcrafted features, and Convolutional Neural Networks (CNN) that discover low-level features through stochastic descent training. BIOPAD shows superior performance to both HMAX and CNN in all of the three presentation attack databases examined and these results were consistent in two different classifiers (Support Vector Machine and k-nearest neighbor). In certain cases, our findings have shown that BIOPAD can produce authentication rates with 99% accuracy. Finally, we further introduce a new presentation attack database with visible and near-infrared information for direct comparisons. Overall, BIOPAD's operation, which is to fuse information from different spectral bands at both feature and score levels for the purpose of face presentation attack detection, has never been attempted before with a biologically-inspired algorithm. Obtained detection rates are promising and confirm that near-infrared visual information significantly assists in overcoming presentation attacks.

Keywords: face biometrics, presentation attack detection, anti-spoofing, multiple sensor fusion, biologically-inspired biometrics

# INTRODUCTION

Biometrics have a long history of existence and usage in various security environments. Modern biometric systems utilize a variety of physiological characteristics also known as "biological identifiers." For example, non-intrusive biometric patterns extracted from a finger, palm, iris, voice, gait (and their fusion in multimodal biometric systems), can provide a wealth of identity

Edited by:

Hagit Hel-Or, University of Haifa, Israel

#### Reviewed by:

Alejandro Linares-Barranco, University of Seville, Spain Manuel Jesus Dominguez-Morales, University of Seville, Spain

\*Correspondence:

Aristeidis Tsitiridis aristeidis.tsitiridis@urjc.es Enrique Cabello enrique.cabello@urjc.es

Received: 17 July 2018 Accepted: 09 May 2019 Published: 28 May 2019

#### Citation:

Tsitiridis A, Conde C, Gomez Ayllon B and Cabello E (2019) Bio-Inspired Presentation Attack Detection for Face Biometrics. Front. Comput. Neurosci. 13:34. doi: 10.3389/fncom.2019.00034 information about a person. Face biometrics in particular, pose a challenging practical problem in computer vision due to dynamic changes in their settings such as fluctuations in illumination, pose, facial expressions, aging, clothing accessories, and other facial feature changes such as tattoos, scars, wrinkles and piercings. The main advantage of face biometric applications is that they can be deployed in diverse environments at low cost (in many cases, a simple RGB camera is sufficient) without necessitating substantial participation and inconvenience from the public. Public acceptance of face biometrics is also the highest amongst all other biological identifiers. Modern day applications making extensive use of face biometric systems include, mobile phone authentication, border or customs control, visual surveillance, police work, and human-computer interaction. Regardless of the numerous practical challenges in this field, face biometrics still remain a heavily researched topic in security systems.

Face biometric systems are susceptible to intentional changes in facial appearance or falsification of photos in official documents known as, "presentation attacks." For example, impostors may acquire a high quality face image of an individual and manipulate it either printed on paper, on a mask or even on a smartphone display to deceive security camera checkpoints. The significant reduction in high-definition portable camera size also means that impostors have easy access to tiny digital cameras that discretely or secretively capture face images of unsuspecting individuals. Moreover, with the vast online availability of face images in public or social media, it is relatively easy to acquire and reproduce a person's image without their consent. "Presentation Attack Detection (PAD)" or less formally known "anti-spoofing," engulfs the detection of all spoofing attempts made on biometric systems. Therefore, accurate and fast PAD is an important problem for authentication systems across many platforms and applications (Galbally et al., 2015) in the fight against malicious security system attacks. Basic face presentation attacks often are: (a) printed face on a paper sheet. Sometimes a printed face is shown with eyes cropped out so that the impostor's eyes blink underneath. (b) Digital face displayed on a screen from digital devices such as tablets, smartphones, and laptops. This kind of face presentation attacks can be static or video. In video attacks facial movements, eye blinking, mouth/lip movements or expressions are usually simulated through a short video sequence. (c) A 3D mask (paper, silicon, cast, rubber etc.) specifically molded for a targeted face. In addition, impostors may also try identity spoofing by using more sophisticated appearance alteration techniques or their combinations: (1) Glasses corrective or otherwise and/or contact lenses with possible color change. (2) Hairstyle, change in color, cut/trim, hair extensions etc. (3) Make-up or fake facial scars. (4) Real and/or fake facial hair. (5) Facial prosthetics and/or plastic surgery.

Presentation attacks in images can be detected from anomalies in image characteristics such as liveness, reflectance, texture, quality, and spectral information. Sensor-based approaches are considered efficient strategies to investigate such image characteristics and naturally involve the usage (and fusion) of various camera sensors that capture minute discrepancies. A sensor-based method that uses a light field camera sensor with 26 different focus measures together with image descriptors (Raghavendra et al., 2015) reported promising PAD scores. With the aid of infrared sensors authors in Prokoski and Riedel (2002) analyzed facial thermograms for rapid, and varied illumination environments. Similar thermography methods were presented in Hermosilla et al. (2012) and Seal et al. (2013). Motion-based techniques are mostly employed in video sequences to detect motion anomalies between frames. Some representative methods of this type of PAD algorithms used Eulerian Video Motion Magnification (Wu et al., 2012), Optical Flow (Anjos et al., 2014), and non-rigid motion with face-background fusion analysis (Yan et al., 2012). Liveness-based approaches extract image features that focus on the liveness phenomena of a particular subject. Using this approach, algorithms scan liveness patterns in certain facial parts such as facial expressions, mouth or head movements, eye blinking, and facial vein maps (Pan et al., 2008; Chakraborty and Das, 2014). Texture based methods investigate texture, structure and overall shape information of faces. In conventional terms, commonly used texture-based methods rely on Local Binary Patterns (Maatta et al., 2011; Chingovska et al., 2012; Kose et al., 2015), Difference of Gaussians (Zhang et al., 2012) and Fourier frequency analysis (Li et al., 2004). For quality characteristics, a notable image quality method in Galbally et al. (2014) proposed 25 different image quality metrics as extracted between real and fake images in order to train classifiers which are then used for the detection of potential attacks.

In today's society, face perception is extremely important. In the distant past, our very survival in the wild depended on our ability to collaborate collectively as species. As a consequence, the human brain over the millennia has evolved to perform facial perception in an effortless, rapid and efficient manner (Ramon et al., 2011). The ever increasing requirements in complexity, power and processing speed, have motivated the biometric research community to explore new ways of optimizing facial biometric systems. Therefore, it should not come as a surprise that biology has recently become a valuable source of inspiration for fast, power efficient and alternative methods (Meyers and Wolf, 2008; Wang et al., 2013).

The fundamental biologically-motivated vision architecture consists of alternating hierarchical layers mimicking the early processing stages of the primary visual cortex (Hubel and Wiesel, 1967). It is established from past research that as visual stimuli are transmitted up the cortical layers (from V1– V4), visual information progressively exhibits a combination of selectivity and invariance to object translations such as size, position, rotation, depth etc. In the past, there have been many vision models and variants inspired from this approach such as the "Neocognitron" (Fukushima et al., 1980), "Convolutional neural network" (LeCun et al., 1998), and "Hierarchical model and X" (Riesenhuber and Poggio, 2000). Over the years, these models have performed incredibly well in many object perception tasks and today are recognized as equal alternatives to statistical techniques. In face perception, biologically-inspired methodologies have been applied successfully for some years and have proven reliable as well as accurate (Lyons et al., 1998; Wang and Chua, 2005; Perlibakas, 2006; Rose, 2006; Meyers and Wolf,

#### 2008; Pisharady and Martin, 2012; Li et al., 2013; Slavkovic et al., 2013; Wang et al., 2013).

There are many common characteristics in biologicallymotivated algorithms and perhaps the most important aspect is the extensive use of texture-based features in either 2D or 3D images. Reasons for designing a biologically-inspired model would be its projected efficiency, parallelization and speed in extremely demanding biometric situations. Contemporary state-of-the-art methods are efficient in selected environments with high availability of data but sifting each frame with laborious and lengthy CNN training, sliding windows or pixel-by-pixel approaches requires an incredible amount of available resources such as storage capacity, processing speed and power. Nevertheless, biologically-inspired systems have almost entirely been expressed by deep learning CNN architectures. In Lakshminarayana et al. (2017), spatio-temporal mappings of faces extraction is followed by a CNN schema, and discriminative features for liveness detection were subsequently acquired. This approach produced impressive results on the databases examined but their setup relied solely on video sequences which penalize processing speed and are not always available in the real world, especially in border control areas where a single image should suffice. Other CNN models (Alotaibi and Mahmood, 2017; Atoum et al., 2017; Wang et al., 2017) explored depth perception prior to application of a CNN that distinguished original vs. impostor access attempts. In Alotaibi and Mahmood (2017), depth information was produced with a non-linear diffusion method based on an additive operator splitting scheme. Even though only a single image was required in this work, the use of only one database (and the high error rates in the Replay-Attack database) did not entirely reveal the potential of this approach. Another CNN approach was presented in Atoum et al. (2017) where a two-stream CNN setup for face anti-spoofing was employed by extracting local image features and holistic depth maps from face frames of video sequences. Experimentation with this CNN setup showed reliable results with a significant cost on practicality i.e., training two separate CNNs along with all intermediate processing steps. In Wang et al. (2017), a representation joining together 2D textual information and depth information for face anti-spoofing was presented. Texture features were learned from facial image regions using a CNN and face depth representation was extracted from Kinect images. The high error rates and limited experimentation procedure made their findings rather questionable. Finally, in Liu et al. (2018) a CNN-RNN (Recursive Neural Network) model was used to acquire face depth information with pixel-wise supervision, by estimating remote photoplethysmography signals together with sequence-wise supervision. The accuracy of this method relied heavily on the number of frames per video which makes this approach computationally heavy.

Overall, Convolutional Neural Network approaches and the manner in which they are executed or accelerated in hardware is a big subject of debate in our world today. They require large amounts of resources in hardware, software and energy to be effectively trained. However, since end-users have different hardware/software configurations, no particular effort was given to hardware optimization or software acceleration. The investigation of a biologically-inspired PAD secure system was developed as part of two funded projects, the European project ABC4EU and the Spanish national project BIOINPAD. End-users in both projects (i.e., the Spanish national police, Estonian police, Rumanian Border Guard) were interested in a new approach to the PAD problem.

Over the years, bio-inspired systems have received significant interest from the computer vision community because their solutions can relate to real-world human experiences. Thus, the main research contribution of this work has been the introduction of a system that handles video presentation attack detection from a biologically-inspired perspective. A system that has a straightforward and simple architecture able to cope with visual information from a single frame at high precision rates. Our design focus has been the development of a bioinspired system with a clear structure and relatively little effort. In addition, this paper summarizes precision rate results obtained during our research and compares them against other known models to enhance the comparative scope and understanding. The system has been evaluated with different databases in the visible, and near-infrared (and their fusion) spectral regions. This is illustrated over several sections of this article which is organized in the following way. In section Methodology and BIOPAD's structure, definitions and methodology that have led us to the development of the BIOPAD model are discussed, followed by a detailed explanation of the model's structure. Furthermore, in that section, we demonstrate the biologically-inspired techniques used, the model's general layout, and individual layer functionality. Section Experiments describes all databases used (section Databases), explains our biometric evaluation procedures (section Presentation attack results) and analyses all experiments conducted for the BIOPAD, Hierarchical Model And X (HMAX) and CNN (AlexNet) models. Section Experiments is further divided into visible (section Visible spectrum experiments) and near-infrared (section Near-infrared experiments and cross-spectral fusion) experiments for a better comparison between the two approaches explored. Finally, the last section summarizes all of our conclusions in this research work.

# METHODOLOGY AND BIOPAD'S STRUCTURE

In the first part of this section, the overall layered structure is described, followed by the biologically-inspired concepts that have been used as core mechanisms in BIOPAD. In the last section, each layer is individually explored, along a full explanation of its operation in a pseudo-like manner.

#### Center-Surround and Infrared Channels

Mammals perceive incoming photons through the retina in their eyes. The number of individual photoreceptors in the retina of the human eye varies from person to person and in the same person from time to time, but on average each eye consists of ∼5 million cones, 120 million rods and 100 thousand photosensitive retinal ganglion cells (Goldstein, 2010).

In the human retinae, rod photoreceptors peak at ∼500 nm, they are slow response receptors, come in small numbers, possess large receptive fields, and are suitable for dark environments i.e., night time. However, cone receptive fields are narrower and are tuned to different wavelengths of light. They are considerably greater in numbers than rods and hence, are responsible for visual acuity. Bipolar retinal cells bear the task of unifying incoming visual information from cones and rods (Engel et al., 1997). Furthermore, on-center and off-center bipolar cells operate in a center-surround process between red-green and blue-yellow wavelengths. For example, on-center Green-Red (RG) bipolar cells are going to maximally respond when red hits the center of their receptive field only and are inhibited when green is at their surrounding region. Vice versa, this operation is reversed for an off-center RG bipolar cell where excitation only occurs when the detectable green wavelength is incident in the surrounding region. As shown in **Figure 1**, this can be further applied for the blue-yellow and lightness channels. The color opponent space is defined by the following equations (Van De Sande et al., 2010):

$$\text{O1 } = \text{ (R - G)} / \sqrt{2} \tag{1}$$

$$\text{O2}^{\cdot} = \text{(R} + \text{G} - \text{2B)} / \sqrt{6} \tag{2}$$

$$\mathbf{O3} = \langle \mathbf{R} + \mathbf{G} + \mathbf{B} \rangle / \sqrt{3} \tag{3}$$

The O3 opponent channel is the intensity channel and color information is conveyed by channels O1 and O2. In BIOPAD, when the input image is in RGB, all three opponent channels are processed simultaneously and in order to make use of the available infrared information, an additional channel NIR is added in the fourth channel dimension.

The use of infrared or thermal imaging alongside the visible spectrum, has been the subject of investigation many times in the past (Kong et al., 2005) and Gabor filters with near-infrared data have been applied together with computer vision algorithms (Prokoski and Riedel, 2002; Singh et al., 2009; Zhang et al., 2010; Chen and Ross, 2013; Shoja Ghiass et al., 2014). However, the use of infrared spectra in presentation attack detection using a biologically-motivated model, to our knowledge, is a first with this research work.

The actual infrared range of wavelengths can be huge, spanning from 7 microns all the way up to 300 microns and generally these bands, are undetectable to the human eye. However, there is evidence that infrared wavelengths up to 10 microns under certain circumstances are detectable by humans as visible light (Palczewska et al., 2014). From a biological perspective, the exact mechanism of near-infrared perception in the visual cortex is unknown. In BIOPAD and at low feature level, it is treated as an additional channel input from the retina, with a range of normalized pixel values as provided by the sensor (**Figure 2**). Infrared data acquisition and sensor information is shown in section Presentation attack results.

#### Area V1—Edge Detection

As visual signals travel to the primary visual cortex through the lateral geniculate nucleus, area V1 orientation selective simple cells process incoming information (Hubel and Wiesel, 1967) from the retinae and perform basic edge detection operations for all subsequent visual tasks. They serve as the building block units of biological vision. It is already well-established from literature that orientation selectivity in V1 simple cells can be precisely matched by Gabor filters (Marcelja, 1980; Daugman, 1985; Webster and De Valois, 1985).

A Gabor filter is a linear filter which is defined as the product of a sinusoid with a 2D Gaussian envelope and for values in pixel G

coordinates (x, y), it is expressed as:

$$\mathbf{u}(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{\mathbf{X}^2 + \boldsymbol{\nu}^2 \mathbf{Y}^2}{2\sigma^2}\right) \cos\left(\frac{2\pi}{\lambda}\right) \mathbf{X} \tag{4}$$

$$X = \begin{array}{c} \mathbf{x}\cos\theta - \mathbf{y}\sin\theta \end{array} \tag{5}$$

$$\mathbf{Y} = -\mathbf{x}\sin\theta + \mathbf{y}\cos\theta\tag{6}$$

In Equation 5, γ is the aspect ratio and in this work is set to 0.3. Parameter λ is known as the wavelength of the cosine factor and together with the effective width, parameter σ, specify the spatial tuning accuracy of the Gabor filter. Ideally, to optimize the extraction of contour features from V1 units for a particular set of objects, some form of learning is necessary to isolate an optimum range of filters. However, this process adds complexity and it is time-consuming since it requires a huge number of samples, as experiments on convolutional neural networks have shown in literature. In order to avoid this step, Gabor filter parameters are hardcoded directly into our model following parameterization sets that have been identified from past studies. Two different parameterization settings have been considered (Serre and Riesenhuber, 2004; Lei et al., 2007; Serrano et al., 2011). Our preliminary experiments have shown that the two particular Gabor filter parameterization ranges, have no noticeable effect on PAD results. Thus, we chose the parameterization values given (Serrano et al., 2011).

Additionally, it is known that V1 cell receptive field sizes vary considerably (McAdams and Reid, 2005; Rust et al., 2005; Serre et al., 2007) to provide a range of thin to coarse spatial frequencies. Similarly, four different receptive field sizes were used here with pixel dimensions 3 × 3, 5 × 5, 7 × 7, and 9 × 9. Coarser features are handled by area V2, explained in the next section.

#### Area V2—Texture Features

In general, the significance of textural information is sometimes neglected or even downplayed in past biologically-inspired vision models. In face biometrics, as explained previously in the introductory section, there is a long list of texture-based presentation attack detection models and texture information is considered a crucial feature against attacks.

The role of cortical area V2 in basic shape and texture perception is essential. V2 cells share many of the edge properties found in V1. Nevertheless, V2 cell selectivity has broader receptive fields and is attuned to more complex features compared with V1 cells (Hegdé and Van Essen, 2000; Schmid et al., 2014). In addition to broader spatial features, this layer processes textural information and is therefore capable of expressing the different nature of surfaces. This is a crucial advantage in face presentation attack detection where there is a wealth of information hidden within the texture of faces, facial features or face attacks. For example, texture of beards, skin, and glasses can prove a valuable feature against spoofing attacks mimicking their nature.

V2 cells are effectively expressed by a sinusoidal grating cell operator though other shape characteristics also correspond well (Hegdé and Van Essen, 2000). The grating cell operator has not only shown great biological plausibility with respect to actual V2 texture processes but has also proven superior to Gabor filters in texture related tasks (Grigorescu et al., 2002). Its response is relatively weak to single bars but in contrast, it responds maximally to periodic patterns.

The approach used here (Petkov and Kruizinga, 1997) consists of two stages. In the first stage grating subunits generate on-center and off-center cells responding to periodicity much like retina cells. In the following stage, grating cell responses of a particular orientation and periodicity are added together, a process also known in neurons as spatial summation (Movshon et al., 1978).

A certain response Gr of a grating subunit at position (x, y), with orientation θ and periodicity λ is given by Petkov and Kruizinga (1997):

$$\operatorname{Gr}\left(\mathbf{x},\mathbf{y}\right)\_{\boldsymbol{\theta},\boldsymbol{\lambda}} = \begin{cases} 1, \; \operatorname{if} \; \forall \; n, \; \operatorname{M}\left(\mathbf{x},\mathbf{y}\right)\_{\boldsymbol{\theta},\boldsymbol{\lambda},\boldsymbol{n}} \ge \rho \operatorname{M}(\mathbf{x},\mathbf{y})\_{\boldsymbol{\theta},\boldsymbol{\lambda}}\\ 0, \; \operatorname{if} \; \exists \; n, \; \operatorname{M}\left(\mathbf{x},\mathbf{y}\right)\_{\boldsymbol{\theta},\boldsymbol{\lambda},\boldsymbol{n}} < \rho \operatorname{M}(\mathbf{x},\mathbf{y})\_{\boldsymbol{\theta},\boldsymbol{\lambda}} \end{cases} \tag{7}$$

where n ∈ {-3 . . . 2}, ρ is the threshold parameter between 0 and 1 (typically 0.9). The maximum activities of M at a given location (x, y) and for a particular selection of θ, λ, n, are calculated as followed (Petkov and Kruizinga, 1997):

$$M\left(\mathbf{x},\mathbf{y}\right)\_{\boldsymbol{\theta},\boldsymbol{\lambda},\ \boldsymbol{n}} = \max\left\{ \begin{array}{c} \mathbf{s}\left(\mathbf{x}',\mathbf{y}'\right)\_{\boldsymbol{\theta},\boldsymbol{\lambda},\boldsymbol{\varphi}\_{\mathbf{n}}} \mid \\ \mathbf{n}\frac{\boldsymbol{\lambda}}{2}\cos\boldsymbol{\theta} \leq \mathbf{x}' - \mathbf{x} \mathrel{\mathop{:}} \ (\mathbf{n}+1)\frac{\boldsymbol{\lambda}}{2}\cos\boldsymbol{\theta} \end{array} \right. \\\ \left. \begin{array}{c} \mathbf{n}\frac{\boldsymbol{\lambda}}{2}\sin\boldsymbol{\theta} \leq \mathbf{y}' - \mathbf{y} \mathrel{\mathop{:}} \ (\mathbf{n}+1)\frac{\boldsymbol{\lambda}}{2}\sin\boldsymbol{\theta} \end{array} \right. \\\ \left. \begin{array}{c} \mathbf{n}\frac{\boldsymbol{\lambda}}{2}\sin\boldsymbol{\theta} \leq \mathbf{y}' - \mathbf{y} \mathrel{\mathop{:}} \ (\mathbf{n}+1)\frac{\boldsymbol{\lambda}}{2}\sin\boldsymbol{\theta} \end{array} \right. \end{array} \right.$$

$$\phi\_n = \begin{cases} 0, \ n = -\mathfrak{z}, -1, \ 1 \\ \pi, \ n = -2, \ 0, \ 2 \end{cases} \tag{9}$$

and

$$\mathbf{M}\left(\mathbf{x},\mathbf{y}\right)\_{\boldsymbol{\theta},\boldsymbol{\lambda},\boldsymbol{n}} = \max\left(\mathbf{M}\left(\mathbf{x},\mathbf{y}\right)\_{\boldsymbol{\theta},\boldsymbol{\lambda},\boldsymbol{n}}\right) \tag{10}$$

The responses at M(x, y)θ,λ,<sup>n</sup> in Equation 9, are simple cell responses with symmetric receptive fields along a line segment 3λ. Essentially this means that there are three peak responses for each grating subunit at point (x, y) at a given orientation θ. This line segment is split in λ/2 intervals. The particular position of each interval defines the response of on-center and off-center cells. In other words, a grating cell subunit is maximally activated when on-center and off-center cells of the same orientation and spatial frequency are activated at point (x, y). In Equation 10, φ<sup>n</sup> is the phase offset and for values between 0 and π, it corresponds to symmetric center-on and center-off operations, respectively.

In the second part of V2 grating cell design, a response w of grating cell centered on (x, y) along orientation θ and periodicity λ, is the weighted summation of grating subunits with orientations θ and θ + π, as given below:

$$\mathbf{w}(\mathbf{x},\ \mathbf{y})\_{\boldsymbol{\lambda},\ \boldsymbol{\theta}} = \int \exp\left(-\frac{\left(\mathbf{x} - \mathbf{x}'\right)^2 + \left(\mathbf{y} - \mathbf{y}'\right)^2}{2\left(\boldsymbol{\beta}\boldsymbol{\sigma}\right)^2}\right)$$

$$\left(\mathrm{Gr}\left(\mathbf{x}',\ \mathbf{y}'\right)\_{\boldsymbol{\theta},\boldsymbol{\lambda}} + \mathrm{Gr}\left(\mathbf{x},\ \mathbf{y}\right)\_{\boldsymbol{\theta}+\boldsymbol{\pi},\boldsymbol{\lambda}}\right) d\mathbf{x}' d\mathbf{y}', \boldsymbol{\theta} \in \left[\mathbf{0}, \boldsymbol{\pi}\right) \qquad \text{(11)}$$

Parameter β is the summation area size with a typical value of 5. In our experiments the number of simple cells were empirically chosen at 3 and all other parameter values were set at default values according to Petkov and Kruizinga (1997).

# BIOPAD Structure

Light waves are being continuously perceived by our eyes and every generated electrical impulse passes via the lateral geniculate nucleus of our brain to arrive at the first neurons in the striate cortex (Hubel and Wiesel, 1967). Countless neurons organized in progressive layers then process this information through cascades of cerebral layer modules each intended for a specific operation. Broadly, visual areas in the human brain after visual area V2 follow the dorsal and ventral visual pathways, the "where" and "what" pathways (Schneider, 1969; Ungerleider and Mishkin, 1982). The two streams are layers along two distinct cerebral paths that localize and analyse meaningful information in constant neuronal communication.

BIOPAD's structure mimics the basic visual areas V1 and V2 in the primary visual cortex in a bottom-up fashion (**Figure 2**). Its operation relies on the early stages of biological visual cognition, without any external biases or influences. The design successively processes extracted biologically-inspired features reducing their dimensionality to an extent that they can be used with classifiers that determine original from fake access attempts. Furthermore, through successive biologically-motivated filtering BIOPAD's main strength lies in its ability to transform extracted features into higher dimensional vectors in a simple way that maximizes the separation between them. For example, an important difference between BIOPAD and HMAX is that the latter model's main focus is view-invariant representation of objects irrespective of their size, position, rotation and illumination. Conversely, BIOPAD's purpose is the detection of face spoofing attempts and to this end, invariance properties such as size and position could be valuable with future extensions. Even though invariance properties are generally meaningful in face recognition (Yokono and Poggio, 2004; Perlibakas, 2006; Rolls, 2012), in this particular scenario of face presentation attack detection they add unnecessary complexity or processing delays and are therefore not explored further. More specifically, BIOPAD's proposed structure is separated in the following layers (**Figure 2**):

**Input Layer:** The purpose of the input layer is to prepare image information by scaling down all input RGB images to a minimum of 300 pixels for the shortest edge in order to preserve the image's aspect ratio. This particular image size was chosen as a good compromise between speed/time and computational cost.

**Layer L1:** This layer plays the role of the lateral geniculate nucleus and separates visual stimuli in the appropriate doubleopponency channels (bipolar cells) as given in section Area V1— Edge detection while scaling all pixel values to the same range between 0 and 1.

**Layer L2a:** Gabor filter operations perform edge detection according to parameterization values given in section Area V2— Texture features producing feature maps for each channel. It is important to note that after obtaining filtered outputs from all Gabor filters (in total 192) for each double-opponency channel, a maximum operator is applied so that a particular maximum response of L2a vectors (x<sup>1</sup> . . . xm) in a neighborhood j is given by:

$$\mathbf{r} = \arg\max\_{\mathbf{j}} (\mathbf{x}\_{\mathbf{j}}) \tag{12}$$

The maximum operator is a well-known non-linear biological property exhibited by certain visual cells at low levels of visual cognition that assists in pooling visual inputs from previous layers (Riesenhuber and Poggio, 1999; Lampl et al., 2004) to greater receptive fields. This hierarchical process gradually projects meaningful visuospatial information to higher cortical layers in the mammalian brain (**Figures 3a,b**).

**Layer L2b:** In this layer grating cell operations are performed according to the settings given in section BIOPAD structure. Subsequently, grating outputs are spatially summed with outputs from L2a, in order to form a single L2 output for each of the three double-opponency channels. Spatial summation is another property of the visual cortex and like the maximum operator it is intended to linearly combine presynaptic inputs into outputs for higher layers (Movshon et al., 1978). Spatial summation is used in this layer in order to preserve the spatial integrity and sensitive texture information in faces (**Figure 3c**).

**Layer L3:** The three double-opponency channels after spatial summation (**Figure 3d**), contain both edge and texture features. The information of these channels along with the RG-BY spectral channels from L1 that contain the spectral differences of each image, are aggregated into spatial histograms with a window size of 20 units and bin size of 10. These values were empirically selected after experimentation as ideal for the particular layer dimensions. These spatial histograms have been used before in the context of face recognition but with lower level features at L1 (Zhang et al., 2005). Here, they are employed at an intermediate level of feature processing and with various types of biologicallike features. It is further important to note here that since all these spatio-spectral channels carry different types of visual information, they are never mixed together.

**Layer L4:** In this layer all L3 information from the previous layer is simply concatenated and sorted in a multidimensional vector for either the training or testing phase, without any further processing. Vector dimensions vary according to the size of the dataset and choice of parameters within the model. For example, if from the previous L3 settings spatial histograms are performed over larger regions or if the input image layer of the image is set to smaller dimensions (for faster processing speeds), then the total number of vectors extracted will be smaller. Moreover, if the total number of images in the dataset changes, so does the vector dimension size, i.e., md×np, where m are the vectors extracted from previous layers with length d and n are the columns of vectors per image p.

**Layer L5:** Supervised classification takes place in this layer and any classifiers used can be trained with the extracted feature vector from L4. Training data are selected by following the 10-fold cross-validation technique. The supervised classifiers chosen for this work were a Support Vector Machine (SVM) with a linear kernel and k-Nearest Neighbor (KNN) with Euclidean distance.

BIOPAD's overall operation is further demonstrated with a pseudo-code approach below:

#### RGB Data Setup

Each PAD database consists of single **RGB** frame samples for a particular person's authentic video sequence and their presentation attacks. The PAD image database is then split in 70% training samples (Tr) 30% samples for testing (Ts) with cross-validation in 10-folds.

#### if RGB case train then,

for each random Tr sample of each fold do,


for each opponency channel O1 (red –green differences), O2 (blue–yellow) and O3(lightness) do,

	- L1Tr = O<sup>r</sup> · G<sup>f</sup> , where L1Tr is a multidimensional array of m × n ×192 convolved versions of the T<sup>r</sup> frame with V1-Gabor like filters.
	- Extract the maximum response using Equation (12) at every position along the dimension of convolutions to obtain a new matrix L1M • Normalize L1<sup>M</sup> with zero mean and unit variance.
	- L2Tr = O<sup>r</sup> · G<sup>r</sup> , where L2Tr is a multidimensional array of m × n × θ convolved versions of the Tr frame with V2 -grating filters.
	- Extract the maximum response using Equations (10–12) at every position along the dimension of convolutions to obtain a new matrix L2M.
	- Normalize L2<sup>M</sup> with zero mean and unit variance.

for each random Ts sample of each fold do,

repeat steps (1-6) as above and use 5920 column vectors of Ts to extract predictions from the trained classifier

#### RGB and NIR Data Setup

The FRAV database consists of **RGB** and **NIR** single samples for a particular person's authentic video sequence and their presentation attacks. The PAD image database is then split in 70% training samples (Tr) 30% samples for testing (Ts) with cross-validation in 10-folds, maintaining RGB and NIR original sample ratios.

#### if RGB and NIR case train then,

for each random Tr sample of each fold, do

repeat steps (1-2) and (3-6). At L1 for each opponency channel O1 (red –green differences), O2 (blue – yellow), O3(lightness), NIR (near-infrared) extract 7100 L4 column vectors for each Tr sample during classifier training. else if RGB and NIR case test then,

#### for each random Ts sample of each fold do,

repeat steps (1-2) and (3-6). At L1 for each opponency channel O1 (red –green differences), O2 (blue – yellow), O3(lightness), NIR (near-infrared) extract 7100 L4 column vectors of Ts for predictions obtained from the trained classifier.

# EXPERIMENTS

It is important to note that in all experiments for both the genuine access and impostor attacks, only one photo per person was used from the entire video sequences. The databases employed in this work and their different spoofing attacks are explained in section Databases. Section Presentation attack results presents the obtained results in conventional biometric evaluation measures. The remaining part of this section is further divided into experiments in the visible and near-infrared spectrum. In this subsection, the different spectra are examined individually and subsequently, their cross-spectral fusion at feature, and score levels. Since our model currently does not perform any liveness detection method, successive video frames are not being considered. For the purpose of homogeneity and statistical accuracy between datasets, train and test data were divided with the cross-validation technique, bypassing the original train/test data split of some databases as has been explained in the previous section in more detail.

#### Databases

The Facial Recognition and Artificial Vision (FRAV) group's "attack" database addresses several critical issues compared to other available face PAD databases. The number and type of attacks can vary significantly in each facial presentation attack database and by large, databases of the past never included a large sample of known threats. In addition to the sample of individuals examined being relatively small, little attention was paid in the multitude of human characteristics often occurring within human populations e.g., beards, glasses, eye color, haircuts etc. At the same time, sensor equipment is often limited and out-dated to contemporary technology products found in the market today. These shortcomings necessitated the creation of an up-to-date PAD facial database according to ISO/IEC and ICAO standards with a larger statistical sample, multi-sensor information and inclusion of all basic attacks. This database serves as a simulation stepping stone for experimentation ahead for any real-world situation and supplements the list of existing databases found publicly. The introduction of this new database from our group offers the following main characteristics and contributions:

• The largest PAD-ready facial database to date with 185 different individuals of both genders and various age groups.

• The largest collection of sensor data aimed at PAD algorithms. Four different types of sensors namely Intel's

FIGURE 3 | A genuine access attempt vs. a photo-print attack. Top row shows the progressive process of a genuine photo attempt. Bottom row shows the printed photo attack. Column (A) shows the input layer images. Column (B) the L2a layer as processed from edge detection Gabor filters, column (C) the L2b layer processed from texture grating cells and column (D) the combined layers L2a and L2b after spatial summation. The richness and depth of edge-texture information in the original image (top row) is apparent. All participants gave written informed consent for the publication of this manuscript.

Realsense F200, FLIR ONE mobile phone thermal sensor, Sony A6000 ILCE-A6000 and a HIKVISION surveillance camera and therefore covering a range of spectral bands in the visible, nearinfrared (at 860 nm) and infrared (800–1500 nm).

• Various spoofing attack scenarios examined, which include the following types of spoofing attacks:


Lastly, particular attention was paid at uniformly illuminating all faces using artificial lighting. Two T4 fluorescent tubes operating at 6,000 K−12 Watts each, evenly distributing multi-directional light to all subjects. **Figure 4** illustrates all of the presentation attack types explored in the FRAV "attack" database for a given subject using RGB and NIR sensor information.

The CASIA Face Anti-Spoofing (Zhang et al., 2012) database is a database from the Chinese Academy of Sciences (CASIA) Center for Biometrics and Security Research (CASIA-CBSR). This database contains videos at 10 s of real-access and spoofing attacks of 50 different subjects, divided into train and test sets with no overlap. All samples were captured with three devices at different resolutions: (a) low resolution with an old 640 × 480 webcam, (b) normal resolution with a more up-to-date 640 × 480 webcam and c) high resolution with a 1920x 1080 Sony NEX-5 camera. Three different attacks were considered, (a) warped, spoofing attacks are performed with curved copper paper hardcopies of high-resolution digital photographs from genuine users, (b) cut, attacks are performed using hardcopies of high-resolution digital photographs from genuine users, with the eye areas cut out to simulate eye blinking, c) video, genuine user videos are replayed in front of the capturing device using a tablet.

The MSU Mobile Face Spoofing Database or MFSD (Wen et al., 2015) for face spoof attacks, consists of 280 video clips of photo and video attack attempts of 35 different users. This database was produced at the Michigan State University Pattern Recognition and Image Processing (PRIP) Lab, in East Lansing, US. The MSU database has the following properties, (a) mobile phones were used to acquire both genuine faces and spoofing attacks, (b) printed photos were generated as high-definition prints and their authors claim that these have much better quality than printed photos in other databases of this kind. Two types of cameras were used in this database, (a) built-in camera in MacBook Air at a resolution of 640 × 480, and (b) front-facing camera in the Google Nexus 5 Android phone at a resolution of 720 × 480. Spoofing attacks were generated using a Canon SLR camera, recording at 18.0 M pixel photographs and 1,080

FIGURE 4 | An example of a subject from the FRAV "attack" database. Top row left to right: Genuine access RGB photo, RGB Printed photo attack, RGB printed mask attack, RGB printed mask with eyes exposed attack, RGB tablet attack. Bottom row left to right: Genuine access NIR photo, NIR printed photo attack, NIR printed mask attack, NIR printed mask with eyes exposed attack, NIR tablet attack. All participants gave written informed consent for the publication of this manuscript.

p high-definition video clips and iPhone 5S back-facing camera, recording 1,080 p video clips.

#### Presentation Attack Results

BIOPAD was evaluated with three different databases, FRAVattack, CASIA, and MFSD. The main concern of our experiments was the detection success rate of spoofing attacks made by potential impostors. In simple terms, the system was required to effectively differentiate between fake and genuine access attempts. This was treated as a two-class classification problem. The applied biometric evaluation procedures are defined for the spoofing False Acceptance Rate (sFAR) and False Rejection Rate (FRR) as:

$$\text{sFAR} = \frac{\text{Impostor attacks seen as genuine}}{\text{Total number of attacks}} \tag{13}$$

$$FRR = \frac{\text{Rejected genuine access attempts}}{\text{Total number of genuine access attempts}} \tag{14}$$

Moreover, presentation attack detection is further presented according to SC37ISO/IEC JTC1 Biometrics (2014) with an additional measure, Average Classification Error Rate (ACER). The average of impostor attacks incorrectly classified as genuine attempts and normal presentation incorrectly classified as impostor attacks is given by:

$$ACER = \frac{sFAR + FRR}{2} \tag{15}$$

Train and test data were partitioned using the k-fold cross validation technique. All scores were obtained using 10-folds and in order to further testify performance scores, and L4 feature vectors were essentially classified using two different schemas. A Support Vector Machine (SVM) classifier with two different kernels linear, Radial Basis Function (RBF) and a k-nearest neighbor (KNN) classifier of n = 2 nearest neighbors with Euclidean distance as a distance measure. In reality, the number of neighbors varies according to the dataset but for the two class problem here out of all n values examined, two produced the best average on all datasets as found through cross-validation. In the beginning, BIOPAD was examined only on the RGB images of all three databases and then on both RGB/Near-Infrared (NIR) images at feature-score levels for the FRAV attack database only since infrared data is unavailable for the other databases.

#### Visible Spectrum Experiments

Accuracy rates are defined as the number of images for each database correctly classified as genuine or fake, i.e., true positives and true negatives. The average classification accuracy scores and standard deviation values from all trials in **Tables 1**, **2**, respectively, highlight the large differences between datasets and classifiers. From **Table 1** it can be deduced that BIOPAD analyses presentation threats better than HMAX under all of the examined databases. Depending on the choice of training and testing data as provided by cross-validation, significant deviations in results may occur. This is largely due to the relatively small sample sizes in databases, especially in CASIA and MFSD, leading to significant statistical variance. This has an obvious effect on the TABLE 1 | The average detection percentages (%) of 10 trials with cross-validation.


TABLE 2 | The average standard deviation values (σ 2 ) of 10 trials with cross-validation.


KNN classifier which portrays an unstable and low performance with respect to SVM. The CASIA presentation attack database produced the worst overall results in terms of PAD.

The highest performance has been achieved with the FRAV "attack" database closely followed by the performance achieved with the MFSD database. This is not entirely surprising since both datasets consist of good quality images and high resolution print attacks. The worst performance has been noticed when operating with CASIA photos. The total average performance from all datasets in the BIOPAD SVM linear case is at 96.24% while for HMAX at 92.27%. HMAX is not a dedicated PAD algorithm, nor has it been ever designed for such a purpose. Nevertheless, it can be seen from **Table 1** that HMAX has performed remarkably well which beyond doubt proves the adaptability and capacity that bio-inspired computer vision models have.

In **Table 2**, standard deviation values further paint a picture of relationships between models and datasets. The highest performance was observed in BIOPAD with SVM using the FRAV database and the worst in HMAX KNN using CASIA. Between them there is a sizeable difference of 16% indicating the impact of choosing a particular scenario and classifier in PAD performance. It is further noticeable from this table that BIOPAD provides a more consistent set of results with SVM linear being the overall winner in performance. The detection accuracy rates in **Table 1** provide an insight into the overall ability of the PAD model to detect spoofing attacks. From these results it is seen that the model can achieve a high detection rate at almost 99% with a consistent standard deviation value of 1.14 for the SVM linear kernel case in the FRAV database. Overall, the KNN classifier with the CASIA database has shown the worst performance. While conclusions from **Tables 1**, **2** are useful, biometric evaluation becomes more meaningful when measured in terms of sFAR and FRR which can effectively capture the nature of error.

In addition to HMAX and for a more complete comparison with BIOPAD, the selected databases were analyzed using Convolutional Neural Network. Multiple lines of research have been explored for CNN architectures in last two decades and a huge number of different methods are proposed in references (Canziani et al., 2016; Ramachandram and Taylor, 2017). In this part of the experiments, the objective is to compare the proposed bio-inspired method with a base line CNN model. The architecture selected was based on the well-known LeNet method (LeCun et al., 1998) with the improvements implemented in AlexNet (Krizhevsky et al., 2012). AlexNet has been tested for detecting presentation attacks using faces (Yang et al., 2014; Xu et al., 2016; Lucena et al., 2017). The architecture of the net is formed by eight layers, five convolutional and three fullyconnected. All results provided in **Table 3** are the average of 10 trials.

**Table 3** shows that error percentages are relatively small and comparable with another state-of-the-art algorithm like CNN that have been used in the past. The sFAR percentages for the CASIA and MFSD databases are comparable but there is a significant difference between the two databases in their FRR percentages. Naturally, this is also reflected onto the ACER percentages. The significant difference in FRR percentages indicates the difficulty of distinguishing attacks from genuine access attempts in the CASIA database. The error percentages for the best classifier choice (SVM linear) appear particularly improved for the FRAV attack database. In effect, this proves the importance of image quality in terms of both verification and presentation attack cases. Image quality is a consequence of various reasons and is also reflected in PAD results seen in **Table 1**. We further wanted to investigate the impact V1 and V2 edge and texture operations have on the overall performance of presentation attack detection. These tests were only performed for the SVM linear kernel case. It is worthwhile therefore to examine the separate and combined effect of V1 and V2 operations which can be seen in **Table 4** below in terms of classification percentages. PAD scores rise when V1 and V2 feature vectors are combined together and standard deviation values across all trials indicate better performance. While these values are indicative in these early stages of experimentation, a separate study on optimum parameterization for each layer may yet reveal a more important relationship between edge and texture features in presentation attack detection.

In order to better understand the intrinsic quality difference of the databases used in this work, various metrics were explored. There are numerous image quality metrics that have been developed over the years such as mean square error, maximum difference, normalized cross-correlation and peak signal-tonoise ratio amongst many others. Some of these metrics in fact have been successfully used as a separate PAD algorithm (Galbally et al., 2014). The majority of quality metrics requires the examined image to be subtracted from a reference image. This produces accurate error results only when the images are identical i.e., when the image content is identical. However, in practice face databases are a collection of images from various sensors at different angles. So in this particular case, sharpness metrics capable of measuring the content quality from a single TABLE 3 | AlexNet and BIOPAD average sFAR and FRR scores over 10 trials.


TABLE 4 | The average classification percentages (%) and standard deviation values of 10 trials with cross-validation for V1 and V2 operations.


image would be more suitable and useful. Likewise as before with quality metrics, there is a huge list of sharpness metrics being used in literature today, e.g., absolute central moment, image contrast and curvature, histogram entropy, steerable filters, energy gradients etc. An in-depth database quality analysis is beyond the scope of this work, and we have experimented with several sharpness metrics noting similar responses from all. **Table 5**, shows indicative sharpness results by using the spatial frequency quality (Eskicioglu and Fisher, 1995) metric which has been representatively chosen.

It is evident from the mean values (µ) in **Table 5** that the CASIA dataset on average does not possess the high quality of spatial features seen in the MFSD and FRAV databases. Furthermore, the MFSD dataset has produced the best scores, however it should be highlighted that it does not have the same variety of presentation attacks found in the FRAV "attack" database nor the abundance of test subjects. The "Smartphone" and "Tablet" attacks are a similar type of electronic device attack and there is no provision of mask attack data. To further understand the importance of the aforementioned better, we employ the t-Distributed Stochastic Neighbor Embedding (t-SNE) (Van Der Maaten and Hinton, 2008) technique to visualize and compare presentation attacks in each dataset. L4 vectors as extracted from BIOPAD are used with t-SNE technique at "default" value settings, i.e., 30 dimensions for its principal component analysis part and 30 for the Gaussian kernel perplexity factor, and shown in **Figure 5**.

In **Figures 5A,C,E**, real access attempts vs. impostor attacks are visualized within the same space. These illustrations help understand how genuine users distance from their attacks. It can be easily observed in **Figure 5A** that for the CASIA dataset real access attempts are scattered across the same space as presentation attacks, making the classification process complex and difficult to achieve. This is also confirmed by its reduced detection rates. Different patterns are exhibited from results in **Figure 5B**, where real access attempts occupy a denser area


TABLE 5 | Direct comparison of spatial frequency quality index values for three datasets and for each of their presentation attacks.

FIGURE 5 | L4 vectors visualized with t-SNE for the three datasets. (A) real vs. impostors–CASIA database, (B) presentation attacks—CASIA database, (C) real vs. impostors –MFSD database, (D) presentation attacks—MFSD database (E) real vs. impostors—FRAV "attack" database, and (F) presentation attacks—FRAV "attack" database.

within the impostor attack zone and finally in **Figure 5C**, in which real access attempts fall within a separate space. Looking at the presentation attack images in all datasets closely, it is not surprising to understand why these patterns occur. In **Figure 5B**, mainly due to the low image sharpness in CASIA (**Table 5**) and the nature of attack experiments, L4 vectors cover almost the same range of values and dimensional space. As the separation of presentation attacks and real access attempts improve in **Figures 5D,F** so do the results in **Table 1**. Finally, in **Figure 5F**, some real access attempts exhibit a noticeable overlap with their respective presentation attacks, particularly within the printed photo space, which is the main source of sFAR and FRR errors for the FRAV database. Arguably, the presentation attack that, in general, best matches genuine user information is the "printed photo" attack which can be efficiently faced in the NIR spectrum (section Near-infrared experiments and cross-spectral fusion).

Finally, comparing BIOPAD L4 vectors with HMAX vectors using t-SNE (**Figure 6**), it can be noted that HMAX vectors do not display the same amount of consistency in distinct areas but rather vectors from all attacks appear merged and scattered



across the same area. HMAX lack of bio-inspired features capable of processing texture and color information, leads to hardly distinguishable classes. In effect, this has a toll in presentation attack detection results (**Table 1**).

#### Near-Infrared Experiments and Cross-Spectral Fusion

BIOPAD experiments in the previous section have centered on the visible spectral bands and have shown great promise. Nonetheless, there were noticeable overlaps with certain presentation attacks and so we wanted to further expand BIOPAD's capacity to cope with these attacks and minimize the contribution of errors either directly from the subjects or their ambience. For this reason, our experiments in this section present a direct comparison between the performance for each spectral band, then their fusion at feature and score levels i.e., fusion between the visible and NIR band. At feature level, NIR is treated like an additional channel (**Figure 2**) and L4 vectors from all bands are equally processed in the model. Conversely, at score level visible—NIR bands are processed and classified separately. However, after classification, vectors for each subject are examined over all trials using the weighted sum score level fusion technique in order make a decision on whether the subject is genuine or not.

For this round of experiments, we only process the FRAV "attack" dataset since NIR data is unavailable in other datasets and to our knowledge the FRAV "attack" database is the only face presentation attack dataset in literature. Originally, the FRAV "attack" dataset consists of 185 different subjects and experiments in the previous section were conducted under this sample. In these experiments, available data for different subjects is changed to 157 individuals since there were failure-to-acquire instances during database acquisition. All other setup parameters remain unchanged as before.

In **Table 6**, the best results with the least standard deviation values for BIOPAD across all classifiers were obtained by using NIR images. The drop in performance in the visible spectrum is nearly 1.5% for the SVM linear classifier case and this pattern trend is consistent with other classifier settings. NIR superiority in this type of presentation attack experiments can be further viewed from their t-SNE results in **Figures 7A,B**, where it is apparent that classes are well-separated. These representations can be directly compared with the visible spectrum case (**Figures 5E,F**) where there was a clear overlap between genuine and impostor attacks leading to errors being introduced in sFAR and FFR. The overlap between genuine access attempts and printed photo attacks does not exist in the NIR case and the "tablet" is completely neutralized since there isn't any useful attack information being projected at NIR. Fusing visual information between the visible and NIR at feature level, caused BIOPAD to lose slightly in detection rate performance with respect to NIR only by ∼1.5%, also noticeable in standard deviation values. Moreover, when visualized at feature level and with the visible spectrum analyzed (**Figures 7C,D**), attack patterns appear slightly improved to **Figures 5E,F** but otherwise similar patterns are noticeable.

Furthermore, the performance between the different visual information can be viewed from the Detection Error Tradeoff (DET) curve as shown in **Figure 8**. The DET curve for the FRAV "attack" illustrates the relationship within sFAR and FRR. Naturally, sFAR and FRR confirm the same behavior seen in the percentages, also presented in **Table 6**. As expected the best curve is obtained by BIOPAD with NIR followed by RGB + NIR (feature level) and RGB. Equal error rate or Attack Presentation Equal Error Rate (APEER) is a biometric security system indicator that determines the threshold values for sFAR and FRR. When these rates are equal, their common value is known as the "equal error rate." This value specifies the proportion of false acceptances to false rejections. Low equal error rates mean higher accuracy. In **Figure 8**, the difference between APEERs in BIOPAD's case is 4.15% and undoubtedly shows that for the types of attacks present in the FRAV "attack" database, the best acquisition method for PAD is with the use of a NIR sensor.

#### CONCLUSIONS

In this article we presented a novel presentation attack detection algorithm that relies on the extraction of edge and texture biologically-inspired features, by mimicking biological processes found in areas V1 and V2 of the human visual cortex. This model termed as "BIOPAD," reproduced impressive presentation attack detection rates of up to 99% in certain cases by only utilizing one photo per person and for all attacks examined in the three datasets that were investigated. The main contributions of this research work were to (a) Present a novel biologically-inspired PAD algorithm which behaves comparably

to other state-of-the-art algorithms. (b) Introduce a new PAD database called FRAV- "attack," and (c) Introduce near-infrared band information for PAD experimentation at feature and score levels.

BIOPAD has been successful in surpassing other standard biological-like techniques such as HMAX and CNN which are considered state-of-the-art and benchmark models in biologically-inspired vision research. In addition, the creation, introduction and implementation of a new face presentation attack database by our group termed as "FRAV attack," extended our investigation conclusions with high definition samples and diverse scenarios for the most commonly used spoofing attacks. The "FRAV attack" dataset which encompasses visual data that span from visible to infrared, is expected to set future standards for all new databases in face biometrics.

For the first time in literature, a biologically-inspired algorithm has been directly applied with near-infrared information, specifically for the purposes of face presentation attack detection. As observed from the experimental analysis in section Presentation attack results, BIOPAD features maximize the separation between attacks and as a consequence increase attack detection performance. The sFAR and FRR indicate that BIOPAD error performance falls within acceptable limits and it was further evident from our experiments that the nature of data were better separated in classification by a SVM linear classifier. However, future research in classification might reveal classification schema more effective in dealing with incoming data from multiple sensors.

Our results have also shown that near infrared sensor information is of extreme value and importance for presentation attack detection, significantly outperforming visible spectrum data. In our case, an increase in detection rate of almost 6% was observed between the near-infrared and visible scenarios. While the usefulness of near infrared information appears indisputable, we have proposed data fusion from multiple sensors to minimize errors from future elaborate attack methods that have not yet been investigated. To this end, data fusion at feature and score level indicate enhanced detection rates with respect to rates obtained from the visible spectrum.

Overall, results were promising and BIOPAD can serve as a foundation for further enhancements. Future work will include refinement of the biological-like operations to significantly increase performance and speed, optimization of presentation attack detection for video, and real time processes by incorporating biologically-inspired liveness detection algorithms, experimentation with multiple sensors, different types of novel and sophisticated presentation attacks, and experimentation in dynamic—real world situations.

#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the European Union, Spanish police, Spanish government, and University of Rey Juan Carlos with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of

#### REFERENCES


Helsinki. The protocol was approved by the University of Rey Juan Carlos in Spain.

#### AUTHOR CONTRIBUTIONS

AT is the principal author, main contributor, and researcher of this work. CC helped in the following sections: original research, experiments, and text revision. BG helped during experiments. EC supervised this work and helped in the following sections: original research, during experiments, and text revision.

#### FUNDING

This research work has been partly funded by ABC4EU project (European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement No 312797) and by BIOinPAD project (funded by Spanish national research agency with reference TIN2016-80644-P).

#### ACKNOWLEDGMENTS

Preliminary stages of this work were presented in our work titled Face Presentation Attack Detection using Biologicallyinspired Features (Tsitiridis et al., 2017). The authors would like to specially thank David Ortega del Campo for his significant contribution and effort in acquiring the new FRAV attack database.


Conference and Workshops on Automatic Face and Gesture Recognition (FG) (Ljubljana: IEEE), 1–7. doi: 10.1109/FG.2015.7163104


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tsitiridis, Conde, Gomez Ayllon and Cabello. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Scene Regularity Interacts With Individual Biases to Modulate Perceptual Stability

#### Qinglin Li 1,2,3,4 \* † , Andrew Isaac Meso5†, Nikos K. Logothetis 1,6 and Georgios A. Keliris 1,3,4 \*

<sup>1</sup> Department of Physiology of Cognitive Processes, Max Planck Institute for Biological Cybernetics, Tübingen, Germany, 2 IMPRS for Cognitive and Systems Neuroscience, University Tuebingen, Tübingen, Germany, <sup>3</sup> Bernstein Center for Computational Neuroscience, Tübingen, Germany, <sup>4</sup> Department of Biomedical Sciences, University of Antwerp, Wilrijk, Belgium, <sup>5</sup> Psychology and Interdisciplinary Neurosciences Research Group, Faculty of Science and Technology, Bournemouth University, Poole, United Kingdom, <sup>6</sup> Division of Imaging Science and Biomedical Engineering, University of Manchester, Manchester, United Kingdom

#### Edited by:

Hedva Spitzer, Tel Aviv University, Israel

#### Reviewed by:

Szonya Durant, Royal Holloway, University of London, United Kingdom Huseyin Boyaci, Bilkent University, Turkey

#### \*Correspondence:

Qinglin Li qinglin.li@tuebingen.mpg.de Georgios A. Keliris georgios.keliris@uantwerpen.be

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience

Received: 16 October 2018 Accepted: 06 May 2019 Published: 28 May 2019

#### Citation:

Li Q, Meso AI, Logothetis NK and Keliris GA (2019) Scene Regularity Interacts With Individual Biases to Modulate Perceptual Stability. Front. Neurosci. 13:523. doi: 10.3389/fnins.2019.00523 Sensory input is inherently ambiguous but our brains achieve remarkable perceptual stability. Prior experience and knowledge of the statistical properties of the world are thought to play a key role in the stabilization process. Individual differences in responses to ambiguous input and biases toward one or the other interpretation could modulate the decision mechanism for perception. However, the role of perceptual bias and its interaction with stimulus spatial properties such as regularity and element density remain to be understood. To this end, we developed novel bi-stable moving visual stimuli in which perception could be parametrically manipulated between two possible mutually exclusive interpretations: transparently or coherently moving. We probed perceptual stability across three composite stimulus element density levels with normal or degraded regularity using a factorial design. We found that increased density led to the amplification of individual biases and consequently to a stabilization of one interpretation over the alternative. This effect was reduced for degraded regularity, demonstrating an interaction between density and regularity. To understand how prior knowledge could be used by the brain in this task, we compared the data with simulations coming from four different hierarchical models of causal inference. These models made different assumptions about the use of prior information by including conditional priors that either facilitated or inhibited motion direction integration. An architecture that included a prior inhibiting motion direction integration consistently outperformed the others. Our results support the hypothesis that direction integration based on sensory likelihoods maybe the default processing mode with conditional priors inhibiting integration employed in order to help motion segmentation and transparency perception.

Keywords: visual perception, bias, bayesian, computational modeling, regularity, psychophysics, human perception, motion perception

Our brains are subjected to ambiguous sensory inputs from a variety of sources, yet the world that we perceive appears stable and coherent. To constantly maintain such a percept, dynamic sensory inputs are thought to be combined with our prior knowledge and experience to form what should be consistent neural representations (Knill and Richards, 1996; Rao et al., 2002). Alternative percepts compete dynamically, continuously resulting in changes to the dominant representation driven by interactions taking place at several stages of the cortical hierarchy. Perception can thus vary between multiple outcomes by a myriad of possible mechanisms (Desimone and Duncan, 1995; Beck and Kastner, 2009; Meso et al., 2016b). Biased competition theory suggested that objects simultaneously presented in the visual field compete for neural representation and attention can bias this competition (Desimone and Duncan, 1995; Desimone, 1998; Beck and Kastner, 2009). When stimuli are inherently more ambiguous, such internal processes become more critical in perceptual selection and could govern the outcome of the competition. However, the role of observer bias and how that might interact with key visual stimulus properties which may often control signal strength, remains unexplored. Questions arise following evidence recently found that the human visual system possesses internal templates for regular patterns, indicating that regularity is a coded feature in human vision (Morgan et al., 2012; Ouhnana et al., 2013).

Here, we developed novel bi-stable visual stimuli (**Figure 1**) that exploited the significant role of plaid local elements such as intersections (Stoner et al., 1990), to parametrically manipulate perception between two possible interpretations, coherent and transparently moving. We then probed perceptual stability during the resulting ambiguous motion perception across three stimulus density levels with normal or degraded regularity using a factorial design. Further, a set of Bayesian observer models based on the causal inference frame work (Shams and Beierholm, 2010) were developed to perform a perceptual task analogous to the experiments carried out in order to support the investigation of the underlying mechanism. Causal inference has been demonstrated to model perceptual judgements of multisensory integration (Körding et al., 2007; Sato et al., 2007) and fine motion direction judgments done using discrimination (Stocker and Simoncelli, 2007). The approach tackles the problem of having to decide whether two sensory signals come from the same source (in which case they should be integrated) or come from different sources (in which case they should be segregated). These models typically have just four parameters which correspond to the observer's individual bias toward one or the other of the of the alternatives; two parameters capturing the sensory noise associated with the representation of each competing alternative and finally a prior width parameter which defines the extent of the influence the prior has across the measurement space when it is applied. We implement the models in the current experimental context to explore whether performance changes across the density and regularity conditions measured during the tasks are better explained by shifts in one or both sensory likelihood parameters or in prior parameters.

# MATERIALS AND METHODS

#### Participants and Apparatus

Five subjects (college students, four females) participated in all the experiments, four of whom were naïve to the aims of the study. All had normal or corrected-to-normal vision. The study was approved by the ethical committee of the University of Tuebingen. Before data collection, a written participant informed consent was obtained from each subject.

The experiments were performed in a dimly lit room. The stimuli were programmed using Matlab Psychophysics toolbox (Brainard, 1997) and presented on a 17-inch CRT monitor (iiyama, 21sd017) with a resolution of 1,280 × 1,024 and a refresh rate of 100 Hz. The monitor was gamma corrected with a mean luminance of 15.6 cd/m<sup>2</sup> . The distance from the eyes of the subject to the monitor was 43 cm. Responses from subjects were acquired by using a bespoke 2-button response box (see Procedures). Eye movements were monitored continuously using an infrared video eye tracker (iView XTM Hi-speed, SMI).

## Stimuli

The novel plaid stimuli in this study were designed to mimic and manipulate the local elements—lines and intersections that are carrying the motion signals within the square line plaid stimuli that have been used extensively in psychophysics (Stoner et al., 1990). To achieve this, we decomposed the original plaids into two different types of stimulus patches (see **Figures 1A–C**; **Supplementary Movies 1, 2**): separated lines (SL) and line intersections (LI). Although in what follows we refer to these patches as apertures, it should be noted that their dynamic content remained always the same (SL or LI) independent of the position they were plotted. Thus, this allowed us to manipulate the locations of these motion signals to be either consistent with an underlying plaid or jittered in space. The mimicked plaid from which these apertures were created, consisted of two identical superimposed asymmetric line gratings (Hupé and Rubin, 2003; Takahashi, 2004; Moreno-Bote et al., 2010) with a directional difference of 120◦ (±60 with respect to vertical). Stimulus directions were fixed with respect to the vertical rather than being randomized during the task to avoid previously reported idiosyncratic anisotropies in participant representations of direction (Rauber and Treue, 1999) and to simplify simulated categorical perceptual decisions during the modeling. The spatial frequency of each narrow line grating was 1 cycle per degree, with a duty cycle of 1 pixel or 0.03◦ and a speed of 2◦ per second. In order to minimize the luminance effect of the intersection for plaid stimuli (Stoner et al., 1990; Thiele and Stoner, 2003), the luminance of the small intersections remained the same as that of the line. The color of the lines was black (0.9 cd/m<sup>2</sup> ) and the background was gray (15.6 cd/m<sup>2</sup> ). In Experiment 1 (Regular; **Figure 1B**) their positions were selected based on a regular grid of locations where either intersections or single lines would be expected in the classic plaid (see positions of red and green dotted circles in **Figure 1A**). In Experiment 2 (Irregular; **Figure 1C**), the possible positions of apertures were dynamically jittered vertically from the grid locations (±0.025◦ of visualangle) and SL and LI could be located in any of the locations

on the underlying grid abolishing the regularity of Experiment 1. The diameter of each aperture was 0.2◦ of viewing-angle and 720 potential locations were used with no overlap over a stimulus area with a 23◦diameter. A rhombus-shaped mask was applied upon each aperture so that no terminators leading to the perception of circular apertures would be seen (Pack et al., 2003). The vertical and horizontal distance between the centers of adjacent apertures was 0.5◦ and 0.28◦ of view-angle, respectively. A red fixation cross (0.2◦ of visual-angle) was shown at the center of the stimuli. No apertures were located within a circular area (2◦ of visual-angle diameter) where the fixation was centered. The stimuli shared some similarities with previously used multi-aperture stimuli but also had some critical differences (Amano et al., 2009, 2012): (a) within the apertures we used moving lines instead of drifting Gabors, (b) in the regular condition aperture locations for lines and intersections were selected according to the underlying plaid pattern (Experiment 1), (c) the number of apertures was systematically manipulated, and (d) the proportion of different aperture types was used to parametrically change perception.

The total number of apertures was chosen based on three density conditions: low, medium, and high; with 180, 340, and 680, apertures, respectively. New random positions were selected according to these numbers for each trial. In addition, we parametrically manipulated the ratio between SL and LI along 11 homogeneously spaced proportions within the range of 0% to 100%.

#### Procedures

For both Experiments 1 and 2, subjects were instructed to press a key on the response box to start a trial (see **Figure 1D**). After that, a red fixation cross was shown on the center of the monitor for 1 s. Before trial onset, background luminance was slightly adjusted to the mean luminance depending on the density condition to have a homogeneous mean luminance across conditions and trials. First, a static image was presented for 0.5 s to control for transitional eye movements. Then, the stimulus started moving for 1 s, and subjects had to report their perception (either coherent or transparent) during this period by pressing one of two keys. They were instructed to do so as fast as possible and according to their first impression. In order to avoid potential adaptation effects, each trial was followed with a 0.5 s full field Gaussian noise pattern with mean luminance equal to the average of all trials. A method of constant stimuli was used and each psychometric point came from 30 measurements for each of the 11 points along the parametric manipulation of the ratio of the

different types of apertures for each subject. All conditions were presented in a pseudo-randomized fashion.

At the beginning of each block, a standard nine-point eye tracking calibration was performed. Subjects took a break after each block. For training, subjects performed 4 blocks of 15 trials before each experiment. They were instructed to fixate the center of the screen and use a chin-rest to avoid head movements.

# Theory and Models

Modeling transparent motion perception presents a challenge of separating unlabeled signals which can come from one source or from multiple sources, posing a computational problem similar to that previously studied with vowel sounds (Sato et al., 2007; Feldman et al., 2009). Here, we used the causal inference framework which originates in multisensory perception and considered the problem to be solved as an explicit two-step hierarchical process with an initial unity vs. separation choice and subsequent direction perception made subject to the influence of the initial decision as a conditional estimate (Stocker and Simoncelli, 2007; Zamboni et al., 2016). This class of models typically has four parameters (Körding et al., 2007; Stocker and Simoncelli, 2007): a participant bias parameter—which we did not use in the current work for reasons explained later, two sensory likelihood parameters corresponding to each alternative sensory representation and a prior width parameter which determines the extent to which the likelihoods can be shifted along the measurement space.

An optimal Bayesian model would average over the probability of both hypotheses (Körding et al., 2007; Sato et al., 2007), which in this case would be, coherent dominated by components given by H = h<sup>c</sup> and transparent dominated by the plaid pattern given by H = hp, making a decision by reading out from the averaged probability distribution. For a difficult categorical perceptual decision associated with a global percept with mutually exclusive alternatives like ambiguous global motion, we followed previous work (Sato et al., 2007; Stocker and Simoncelli, 2007; Zamboni et al., 2016), and used an implementation in which the optimality of averaging was sacrificed for a quick and self-consistent decision. In other words, a categorical decision is made and this adjusts the shape of the prior probabilities to influence the refined estimate of the second stage. The visual stimulus contains a superimposed distribution of multiple directions of components θ<sup>s</sup> , from which a sensory measurement of the perceived direction distribution θm, is made by the visual system; an estimate contaminated by Gaussian noise. Given the task at hand in which the alternatives, h<sup>c</sup> (components dominate) and h<sup>p</sup> (single pattern dominates) cannot mutually exist, we impose an assumption that ambiguity resolution forces the system to commit to one alternative, and its corresponding posterior distribution only, which is either P(θ|hc) or P(θ|hp), illustrated in **Figure 2** (Sato et al., 2007).

Three model variants made the following assumptions about the prior: M1 assumed no additional hypothesis about the direction space, i.e., a flat prior with all directions equally likely, then estimation of maximum likelihood P(θm) and then categorization of direction; M2 selectively applied a prior on trials where an initial hierarchical step suggested motion integration of the input was needed, consistent with the use of a slow speed prior which has been shown to explain some cases of motion perception (Weiss et al., 2002); The categorical decision in the second step was based on the estimated maximum posterior direction after multiplication with the excitatory prior (hp). M3 similarly computes a categorical decision from the maximum posterior after multiplication with an inhibitory prior (hc) but in contrast on trials which could not be selected by M2, where component separation is suggested by early noisy computations, which supports motion segregation. This novel configuration implements a prior distribution centered diametrically opposite to the average stimulus direction in the circular direction space so that the average direction is inhibited. This is a viable probability distribution configuration in a circular space. Note that for simulations of configuration M2, no segregate priors (i.e., M3) were applied on trials where integrate was chosen and similarly, for the separate simulations under M3 prior no integrate prior (i.e., M2) was applied to any trials. M4 is a control condition which uses either prior (h<sup>c</sup> or hp) on each individual trial following the initial estimate, a biologically implausible architecture which we used to allow us to contrast conditions.

The probability of the alternative categorical hypotheses H, is given by Equation (1) which includes all the respective likelihoods and priors,

$$P(H|\theta\_m) = P(\theta\_m|H)P(H)/P(\theta\_m) \tag{1}$$

Applying model averaging over the posterior distribution (Stocker and Simoncelli, 2007) of each model results in Equation (2):

$$\begin{aligned} \int P\left(\theta\_{\boldsymbol{s}}|\theta\_{\boldsymbol{m}}\right)d\theta &= 1, & \text{(2)}\\ P\left(\theta\_{\boldsymbol{s}}|\theta\_{\boldsymbol{m}}\right) = P\left(\theta\_{\boldsymbol{s}}\middle|\theta\_{\boldsymbol{m}}\right), & H = h\_{\boldsymbol{c}}\middle|\big(H = h\_{\boldsymbol{c}}\middle|\theta\_{\boldsymbol{m}}\big) \\ + P\left(\theta\_{\boldsymbol{s}}|\theta\_{\boldsymbol{m}}\right), & H = h\_{\boldsymbol{p}}\middle|\big(H = h\_{\boldsymbol{p}}|\theta\_{\boldsymbol{m}}\big), & \text{(3)} \end{aligned} $$

where the composite posterior in Equation (3) is obtained by adding both alternative posterior probabilities corresponding to each perceptual alternative. We simplify Equation (3) which includes the two separate posterior terms by using model selection to propose an initial fast binary variable computation χ(1, 2), (see simulations) corresponding to hypotheses H = h<sup>c</sup> and H = hp, respectively, to hierarchically separate the early discrimination and the estimation tasks (Luu and Stocker, 2018). In each case, one alternative is selected and the remaining term is set to a probability of zero (Stocker and Simoncelli, 2007). We do not seek an optimal solution to Equations (3) and instead following the lead from previous work sacrifice optimality for consistency (Stocker and Simoncelli, 2007; Luu and Stocker, 2018). During simulations, we assign a decision value of χ = 1, if the MLE is closer to the average (pattern direction) than the component direction, and χ = 2 if the MLE is closer to the transparent component direction (see **Figure 4**). This heuristic crudely solves the "one vs. two" component problem and reduces the number of free parameters used in this type of experiments from four to three by avoiding the inclusion of a parameter for bias. While individual differences in participant biases have been

variance <sup>σ</sup>p, while SL is similarly modeled as two Gaussian probability density functions (SSL) centered on <sup>µ</sup> <sup>+</sup> <sup>60</sup>◦ and <sup>µ</sup> – 60◦ respectively with same variance σc. The likelihood P(θm|θ) contains SLI and SSL, combining with the respective prior term P(θ|h) (c) to get the posterior distribution of P(θs|θm) (d). Prior settings are different for M1–M4, see text for details. The prior terms P(θ|hp) and P(θ|hc) are also both Gaussian terms centered on the VA direction which either enhance (hp) or inhibit (hc) the pattern to support integration or segregation, respectively. (e) Decision is made based on a final direction using MAP estimation leading to categorical perception (f).

previously found and modeled (Odegaard and Shams, 2016), in the current work we expected there might be differences within participants across our scene structure conditions and so focused on the interaction between the role of sensory representations and the strength of prior biases. Our heuristic computation of χ similarly constrained all the participants' categorical estimation.

The conditional inference is therefore computed on a given trial according to either,

$$P\left(\theta|\theta\_m,\chi=1\right) = \left.P(\theta\_m|\theta)P\left(\theta|h\_\emptyset\right)/P\left(\theta\_m\right)\right\vert,\tag{4}$$

in the coherent case where pattern motion is reported or,

$$P\left(\theta|\theta\_m,\chi=2\right) = P(\theta\_m|\theta)P\left(\theta|h\_c\right)/P\left(\theta\_m\right),\tag{5}$$

in the case of the transparent choice where the two components are simultaneously perceived. In both Equations (4) and (5), the likelihood term P(θm|θ) is identical and contains Gaussian functions of two components and one pattern term whose width captures the sensory noise, and these are shown together as Equation (6).

$$\begin{split} P(\theta\_m|\theta) &= \frac{A\_{\text{S}}}{\sqrt{2\pi}} \exp\left(-\frac{(\theta-\theta\_{\text{S}})^2}{2\sigma\_{\text{S}}^2}\right) \\ &+ \frac{A\_{\text{S}}}{\sqrt{2\pi}} \exp\left(-\frac{(\theta+\theta\_{\text{S}})^2}{2\sigma\_{\text{S}}^2}\right) + \frac{A\_L}{3\sqrt{2\pi}} \exp\left(-\frac{(\theta)^2}{2\sigma\_L^2}\right) \end{split} \tag{6}$$

The average direction of the distribution in Equation (6) is also the pattern direction, θ<sup>L</sup> = 0. The relative scaling of the Gaussian terms corresponding to the alternative percepts is related by A<sup>S</sup> = 1-AL. The respective prior terms P(θ|hp) and P(θ|hc) are both Gaussian terms centered on the average direction θ=0 which either enhance (hp) or inhibit (hc) the pattern to support integration or segregation, respectively. These are given by Equations (7) and (8) and illustrated in **Figure 2**.

$$P\left(\theta \middle| h\_{\mathcal{P}}\right) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{\left(\theta\right)^{2}}{2\sigma\_{\mathcal{P}}^{2}}\right) \tag{7}$$

$$P\left(\theta \middle| h\_{\epsilon}\right) = 1 - \left(\frac{1}{\sqrt{2\pi}} \exp\left(-\frac{\left(\theta\right)^{2}}{2\sigma\_{C}^{2}}\right)\right) \tag{8}$$

FIGURE 3 | Estimation of bias and stability for regular (A–C) and irregular (D–F) experiments. (A,D) Cartoons of the stimuli across density conditions for the regular (A) and irregular (D) experiments. (B,E) Fitted psychometric functions for each subject across density conditions for the regular (B) and irregular (E) experiments. The error bar on each psychometric function is the standard error of mean estimated by bootstrapping processing by resampling 400 times. Confidence area (in gray) was defined as where the probability of coherent or transparent perception was higher than 75%. (C,F) The direction and amplitudes of bias for each subject corresponding to the conditions of (B) and (E), respectively.

The prior which acts to enhance the vector average direction of Equation (7) is consistent with a previously proposed slow speed prior which has been demonstrated to explain illusory perception for a range of ambiguous motion stimuli (Weiss et al., 2002). The prior inhibiting the part of the direction space where the average lies is a novel contribution in the current work and is consistent with observations of motion repulsion effects which push direction estimates away from the averages of transparent component directions (Mahani et al., 2005; Meso et al., 2016a). Simulated trials are used to generate psychometric data to study the interaction of sensory motion representations and prior distributions that is most consistent with each participant's performance.

#### Simulations

In each trial, assuming a two-step hierarchical process, an MLE estimate based on reduced draws of direction samples of Equation (6) (i.e., 20% of 5,000 used for the full simulation) was used to compute χ based on the distance between the peak of the direction distribution θMAX and the pattern/zero direction. We note that we adopted the convention of making the vertical direction the zero direction so that the component directions flanked this on either side as ±60◦ . Having fixed directions rather than fully randomizing stimulus directions across space over trials simplifies the process of computing the thresholds of Equation (10). The initial estimation of χ varied with a logistic type non-linear probability as the percentage of LI apertures went from 0 to 100. Slope depended on the likelihood parameters and the PSE (P = 0.5) was influenced by the relative widths of the pair of likelihoods. This step captures an implicit categorical decision taken when the stimulus is interpreted at onset using the formulation

$$
\theta\_{MAX} = \operatorname\*{argmax}(P(\theta\_m))\tag{9}
$$

$$\chi = \begin{cases} 1, \text{ if } -\frac{\theta\_l}{2} < \theta\_{MAX} < \frac{\theta\_l}{2} \\ 2, \text{ if } |\theta\_{MAX}| > \frac{\theta\_l}{2} \end{cases} \tag{10}$$

With χ determined, the posterior of Equation (3) is then simulated using the model selection estimates of Equation (4) or (5) which eliminate the redundant term. Five-thousands draws of direction samples are then used for each trial, binned into a discrete probability distribution with a 0.5◦ bin resolution. A MAP estimation computes a direction θ<sup>i</sup> for each single trial i, from which a second forced choice decision for the simulated trial is made. Transparent or coherent is selected based on the maximum direction (T: θS/2<|θ<sup>i</sup> | or C: θS/2>|θ<sup>i</sup> |) in a similar way to Equation (10). The estimates used to make the categorical decisions assume symmetry across the direction space for simplicity and therefore search for one peak which could be near the pattern direction or within either transparent component, both left and right.

Each simulated trial had a fixed set of stimulus parameters, θ<sup>S</sup> = 60◦ and θ<sup>L</sup> = 0 ◦ . The two sensory likelihood parameters σ<sup>S</sup> and σ<sup>L</sup> along with the relevant prior parameters σ<sup>P</sup> or σ<sup>C</sup> [for M2 or M3] were used to generate psychometric functions for comparison to the empirical psychometric functions for each participant under all six conditions. The best fitting parameters [σS, σ<sup>L</sup> and σP/σC] were obtained using an iterative Kullback-Leibler minimization to search the simulated parameter space. Fits to the data were compared across models using Akaike information criterion (Akaike, 1981).

#### RESULTS

Human psychophysics experiments were performed using novel bi-stable line-plaid stimuli (**Figures 1B,C**). Subjects were instructed to report their perception of either a coherent pattern moving upward, or two transparent surfaces sliding over each other in leftward and rightward oblique directions (see Methods). Inspired by the geometric properties of typically used moving line-plaids (**Figure 1A**) (Adelson and Movshon, 1982; Pack et al., 2003) and the architecture of the visual system with very small receptive fields (RFs) in early visual areas, we developed this novel stimulus by decomposing the plaid into two types of local stimulus elements we refer to as apertures: separated lines (SL) and line intersections (LI). In this way, the stimuli could mimic two basic inputs that the visual system could experience locally: 1D- or 2D-motion (green/red apertures, respectively, in **Figure 1A**) based on the dimensions of the features within the aperture. We performed two experiments with the only difference being the positioning of apertures: in Experiment 1 (regular, R) the structure of the mimicked plaid was maintained (**Figure 1B**), whereas in Experiment 2 (irregular, I) the element apertures were spatially jittered (**Figure 1C**). All subjects could consistently fixate within a circular window with radius 0.4 degrees of visual angle (**Figure S1**). For each subject, we first estimated the relative bias toward one of the two possible percepts (transparent or coherent), by calculating the difference between the 50% coherence threshold taken from its fitted psychometric function and the same threshold calculated from the lowdensity population trend that was used as a reference (**Figure 3**). Interestingly, for higher stimulus densities we observed gradual increases in the bias and this effect was more pronounced in Experiment 1 (Regular) in comparison to Experiment 2 (Irregular). Statistical analysis was performed using a linear mixed effects model approach with the bias as independent variable and density and regularity as fixed effects. Subjects were considered as a random effect thus allowing for different intercepts in the model (**Figure 4A**). Statistical significance was evaluated after parameter estimation using an F-test for the fixed effects with density being significant (F(22) = 11.83, P = 0.0023) while the interaction between density and regularity remained a trend (F(22) = 3.32, P = 0.0822). Regularity as a main effect was not significant (F(22) = 1.11, P = 0.3) indicating that on average the two experiments showed comparable biases.

To obtain a quantitative estimate of the stability of the two percepts for each condition, a perceptual stability index (PSI, **Figure 4B**) was calculated for each subject as follows: first, we defined as perceptually stable the stimuli that resulted in either coherent or transparent perception with probability over 75% (i.e., see the shaded areas in either side of the psychometric curve with Pcoherent < 25% or Pcoherent > 75% in **Figure 3**). Then, the PSI was calculated as the fraction of fitted datapoints within the side of the confidence area corresponding to the dominant percept, and the rest of the points (**Figure 4B**). Similar linear mixed effects modeling analysis as for the bias was then performed with the PSI as independent variable. The results showed a significant main effect of density (F(22) = 6.38, P = 0.0193) as well as significant interaction between density and regularity (F(22) = 5.55, P = 0.0278). Regularity as a main effect was not significant (F(22) = 1.88, P = 0.18).

To study the relative contribution of prior experience and sensory representation to the processing of the ambiguous motion direction, we modeled the underlying motion perception task using a Bayesian causal inference framework (Sato et al., 2007; Stocker and Simoncelli, 2007; Shams and Beierholm, 2010). To this end, we used models of increasing complexity (no prior, a transparent prior or a coherent prior, and as a control a model with the use of both priors). In the simplest model architecture (M1, no prior), the maximum likelihood was estimated and categorized depending on whether it was closer to the coherent or transparent direction. For models M2 and M3, a hierarchical sequential computation was assumed and on each simulated trial an initial noisy direction estimate χ, was used to determine whether to apply an excitatory (M2, run as a separate independent simulation from M3) or an inhibitory (M3, run separate from M2) prior, each of which required a single additional Gaussian width parameter centered on the average direction. These would have an effect of shifting posterior probabilities to bias perception either toward coherent (M2) or transparent (M3).

Last, in a control condition, a model M4 was simulated by using the best fitting M2/3 parameters and therefore included separate optimal priors for separation and integration. Motion direction was represented as a linear combination of Gaussian probability density functions representing the LI and SL aperture direction and variance (**Figure 2**; also see Methods). The set of models, M1–M4 were tasked with a forced choice decision on whether each simulated trial corresponded to transparent or coherent, over a number of conditions recreating Experiments 1 and 2.

Example model-fitting results for a representative subject are shown in **Figure 5A** (results for all subjects in **Figure S2**). We then performed model comparison based on the Akaike criterion measures (AIC, Akaike, 1981) to identify the optimal model architecture. The AIC measurements use likelihoods from the fitting residuals to determine which model provides the best explanation for the data, giving a lower score for better fits but penalizing models with more parameters. M3 (transparent prior) was found to be the most appropriate model for the data set based on AIC scores (**Figure 5B**). This suggests a general tendency within the visual system toward separating motion components unless there is strong sensory evidence for integration into a single object (here provided by the line intersections (LI) apertures).

Further, we analyzed the relationship between the best model parameters of M3 and perceptual bias from empirical data to investigate the potential insights into sensory mechanisms of subjective biases. We found a significant linear correlation between the bias and the variability of sensory representation (Gaussian likelihoods) for SL apertures (r <sup>2</sup> = 0.272, p < 0.05, **Figure 5C**) only for the regular experiment suggesting that regularity influences the effectiveness of the sensory representation by decreasing variance. There were no similar trends in the fitted parameters for LI sensory likelihoods and the prior (**Figure S3**).

### DISCUSSION

In this study, we used bi-stable motion perception as a tool to understand processes of perceptual stabilization in the human brain. We used a Bayesian causal inference framework (Sato et al., 2007; Stocker and Simoncelli, 2007; Shams and Beierholm, 2010) to model the internal decision process leading to one of the two alternative interpretations with the aim to understand the relative role of priors and sensory evidence in the selection process. We found, counter-intuitively, that adding more motion information by increasing the number of apertures increased response biases in the task. Individuals' tendencies to either one or the other of the percepts were amplified substantially when we increased the density of stimulus apertures. This led to an increased inter-subject variability, with each subject diverging from the population trend with a magnitude and direction that was related to their original bias (**Figure 4A**). Interestingly, this effect was largely abolished in the irregular condition when the position of elements was jittered with respect to their original location, indicating that this form of contextual organization created by spatial regularity played a major role in the amplification of the bias. As a measure of the effect of bias amplification, we computed a perceptual stability index and found that it linearly increased for higher element density.

To further understand the brain processes leading to this result, we adapted hierarchical motion perception models that posit sequential stages of brain processing including local motion detection, global combination of these local signals and then an interpretation of the representation to support categorical/qualitative decisions. This broad mechanistic view is widely supported by evidence in the literature for both psychophysics and physiology (Burr and Thompson, 2011; Nishida, 2011). In the context of our work, the representation of the local motion information can be reflected directly in the neural responses in directionally selective areas such as MT/MST, however, one of the classic difficulties of motion transparency perception is how such a local representation can be transformed into the qualitative percept (e.g., see Qian et al., 1994; Treue et al., 2000; Meso and Zanker, 2009). To this end, and in particular with respect to prior information encoded in the brain of each participant, we built a battery of Bayesian models (M1–M4; see Methods) with the task to probabilistically select one of the two percepts on a trial-by-trial basis simulating the experiments. These modeled the sensory representations of the 1D- and 2Dmotion input-signals as Gaussian processes each with separate sigma likelihood parameters and, in addition, one of four different prior probability configurations. M3 (which included a segregation prior) provided the best model, suggesting that the visual system selectively applies an inhibition within the direction space to help separate components. Importantly, it should be noted that M3 was the better model even in subjects that were biased toward coherent percepts. We conjecture, that the brain when faced with such tasks applies a conditional implementation of separating priors on some critical trials (Zamboni et al., 2016) and not an integrating one because integration might arise naturally from overlapping signal distributions (Mahani et al., 2005). The proposed hierarchical computation extends recent findings in which participants performed an orientation discrimination followed by an orientation estimation task, with the discrimination found to influence the estimation task (Luu and Stocker, 2018). A similar effect had been found for motion stimuli (Zamboni et al., 2016) with a need for self-consistency proposed as an explanation. We argue that this hierarchical twostep computation might occur during our task, with an implicit early categorical decision needed to resolve the ambiguity resolution known to occur early in motion stimuli (Meso et al., 2016a). In the implemented model, for simplicity, fixed directions were explicitly associated with the categorical decisions. Similar models could be implemented in the future in which, the decision need not be based on the absolute directions but reached based on the distribution of global motion directions after pooling (i.e., a bimodal distribution would signify transparency and a unimodal coherence). In that case, the future tested priors could be adjusted and made independent of direction for example by acting broadly as an attractor or repellant of nearby directions.

Bias stands at the core of signal detection theory (SDT) when applied to both living organisms and machines. In fact, (Green and Swets, 1966), being the first to develop SDT approaches in psychophysics, directly criticized previously used methods for not being able to separate the sensitivity of subjects from their potential biases. In addition to the principle problem of detecting signal within noise, our brains also face the problem of inherently ambiguous sensory inputs. Thus, to make veridical interpretations of the outside world, the brain needs to employ additional mechanisms such as attention and prior experience (Knill and Richards, 1996; Desimone, 1998; Rao et al., 2002; Beck and Kastner, 2009; Meso et al., 2016a). One theory suggested that objects simultaneously presented in the visual field compete and attention can bias the outcome of this competition (Desimone and Duncan, 1995; Desimone, 1998; Beck and Kastner, 2009). Our results are consistent with the general framework of the biased competition hypothesis; however, attention does not seem to be the primary source of the observed biases as there is no reason to expect attention to vary systematically across the different density or regularity conditions. The subjects had to continuously perform the task of reporting their percepts in randomized trials within blocks so attention should have remained largely constant. Moreover, individual bias directions were independent of the stimulus configuration (which was the same for all subjects) precluding bottom-up stimulus driven attention effects. The subject specific results suggested a strong influence of prior experience or assumptions and thus we expected our modeling results might reveal that some subjects would use a "coherence" prior (M2) while others a "transparency" prior (M3). To our surprise, M3 (in comparison to M2; **Figure 5B**) was a better model for all our subjects, including those with biases toward coherence. This suggests that the sensitivity of the visual system of each participant to the two motion signals (sensory σ) was more important for determining bias direction in comparison to the integration prior. We conjecture that motion direction integration based on sensory likelihoods maybe the default processing mode with conditional priors inhibiting integration employed in order to help motion segmentation and transparency perception.

Furthermore, bias in our experiments was increased with stimulus element density. This was also an unexpected finding, as previous studies have shown that increases in the density of random-dot-kinematograms (RDKs) result in coherence thresholds also decreasing (Barlow and Tripathy, 1997) or being unaffected (Eagle and Rogers, 1997; Talcott et al., 2000; Welchman and Harris, 2000). We note, however, that RDK experiments are closer to the foundations of SDT (i.e., detecting signal within noise). We propose that in our scenario, competition between the two motion representations may be enhanced by density increments resulting in the observed increase of the bias toward a preferred representation which would act like a perceptual attractor, an area within the direction space where probability increases at higher densities. This is consistent with reports in previous literature where contrastbased motion signal increases resulted in stronger 2D motion attractors compared to 1D directions in a tri-stable ambiguous motion stimulus (Meso et al., 2016b). In addition, research with RDKs demonstrated that coherence thresholds in 5–6 year olds were (a) much higher, and (b) decreased with dot density in comparison to adults (Narasimhan and Giaschi, 2012). In our view, this provides evidence for coherent perception or integration as the earliest unelaborated default computation and with perhaps the connectivity of the underlying neural circuitry prone to changes by experience during development. This could explain the different directions of the biases in different subjects.

Interestingly, the bias-amplification and the increases in the perceptual stability index with density were largely abolished in the irregular stimuli with jittered aperture positions. This is consistent with previous work demonstrating the importance of regularity (Morgan et al., 2012; Ouhnana et al., 2013) which appears to play a role in the selection of stable neural representations. Another interpretation is that reduction of regularity eliminates in parallel the correspondence of the single stimulus elements to the underlying patterns or "objects," interfering with their spatial integration. This is consistent with studies that have demonstrated a precedence of global features in visual perception (Beck and Kastner, 2005; Phillips et al., 2015; Ding et al., 2017). Moreover, the profound influence of position jitter on the bias indicates that the scale of the integration cannot be completely local nor global as in that case the regular/irregular conditions should not elicit an effect. These results directly indicate that the motion integration mechanisms contributing to individual biases are of "meso-scale" i.e., go beyond single-neuron receptive fields (RFs) in V1 to scales more typical for area V5/MT but not the very large RFs found in size-invariant object selective areas like inferotemporal cortex (IT).

Previous research has found strong evidence for active perceptual stabilization mechanisms in the visual system, such as reorganization of sensory representation during intermittent viewing (Leopold et al., 2002); top-down modulation of betaband synchronization (Kloosterman et al., 2015); feedforward inhibition (Bollimunta and Ditterich, 2012) arousal (Mather and Sutherland, 2011; de Gee et al., 2014); and memory (Wimmer and Shohamy, 2012). Our study suggests that bias serves as an additional factor our brains actively use to stabilize our perception of the world.

#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the ethical guidelines, University of Tuebingen with written informed consent from all subjects. All subjects gave written informed consent in accordance with

#### REFERENCES


the Declaration of Helsinki. The protocol was approved by the ethical committee of the University of Tuebingen.

#### AUTHOR CONTRIBUTIONS

QL and GK conceived and designed the psychophysics experiments. QL performed psychophysics experiments and analyzed all data. AM developed the models. QL, AM, and GK run model simulations and wrote the manuscript. NL supported the study and provided experimental equipment. GK supervised the study. All authors interpreted the experimental results and contributed to the final manuscript and gave final approval for publication.

#### FUNDING

This work was supported by the Max Planck Society, the German Federal Ministry of Education and Research (BMBF; FKZ: 01GQ1002), and a BOF DOCPRO1 (FFB150293) to GK and QL from the University of Antwerp.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2019.00523/full#supplementary-material

Supplementary Figure S1 | Eye movement results. Each subplot shows the averaged eye movement results of each subject from regular/irregular conditions.

Supplementary Figure S2 | Psychometric functions. Empirical and simulated psychometric functions were plotted for each experiment and condition.

Supplementary Figure S3 | The amount of bias of transparent perception is not correlated with sensory representation of LI aperture (regular condition: r: −0.43, p: 0.10; irregular condition: r: 0.10, p: 0.70), nor with prior (regular condition: r: −0.14, p: 0.60; irregular condition: r: 0.06, p: 0.81)

Supplementary Movie 1 | Regular stimuli with representative three density conditions (100, 50 and 0% of LI from in total 500 apertures each, the same as Movie 2).

Supplementary Movie 2 | Irregular stimuli. Note that the demo movies were not used for real experiments.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Li, Meso, Logothetis and Keliris. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Reconciling Color Vision Models With Midget Ganglion Cell Receptive Fields

#### Sara S. Patterson1,2, Maureen Neitz<sup>1</sup> and Jay Neitz<sup>1</sup> \*

<sup>1</sup> Department of Ophthalmology, University of Washington, Seattle, WA, United States, <sup>2</sup> Neuroscience Graduate Program, University of Washington, Seattle, WA, United States

Midget retinal ganglion cells (RGCs) make up the majority of foveal RGCs in the primate retina. The receptive fields of midget RGCs exhibit both spectral and spatial opponency and are implicated in both color and achromatic form vision, yet the exact mechanisms linking their responses to visual perception remain unclear. Efforts to develop color vision models that accurately predict all the features of human color and form vision based on midget RGCs provide a case study connecting experimental and theoretical neuroscience, drawing on diverse research areas such as anatomy, physiology, psychophysics, and computer vision. Recent technological advances have allowed researchers to test some predictions of color vision models in new and precise ways, producing results that challenge traditional views. Here, we review the progress in developing models of color-coding receptive fields that are consistent with human psychophysics, the biology of the primate visual system and the response properties of midget RGCs.

#### Edited by:

Misha Vorobyev, The University of Auckland, New Zealand

#### Reviewed by:

Pablo De Gracia, Midwestern University, United States Jihyun Yeonan-Kim, National Institutes of Health (NIH), United States Andrew B. Metha, The University of Melbourne, Australia

> \*Correspondence: Jay Neitz jneitz@uw.edu

#### Specialty section:

This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience

Received: 01 January 2019 Accepted: 02 August 2019 Published: 16 August 2019

#### Citation:

Patterson SS, Neitz M and Neitz J (2019) Reconciling Color Vision Models With Midget Ganglion Cell Receptive Fields. Front. Neurosci. 13:865. doi: 10.3389/fnins.2019.00865 Keywords: primate retina, color vision, color perception, computational vision, linking hypotheses, cone photoreceptor, retinal ganglion cells

# INTRODUCTION

The first stage of visual processing occurs in the retina, an outpost of the brain located at the back of the eye. Under photopic conditions, photons of light are absorbed by three types of cone photoreceptor (**Figure 1A**), processed by five main classes of retinal neuron, then visual signals are conveyed to the brain by the axons of retinal ganglion cells (RGCs; Wässle, 2004). Midget RGCs make up a large majority of all RGCs in the central retina, where each L- and M-cone provides the sole direct input to an ON and OFF midget RGC circuit (**Figure 1C**; Wässle et al., 1990, 1998; Kolb and Marshak, 2003).

The midget RGC receptive field has a center-surround organization (Kuffler, 1953). In the central retina, this receptive field compares the photon catch in the single L- or M-cone center to the photon catch in neighboring L/M-cones in the surround (**Figure 1C**). Since this configuration compares the activity of cones that differ in both spatial location and spectral sensitivity, midget RGCs have been implicated in both color and spatial vision (Schiller et al., 1990; Martin et al., 2011). Mammalian RGCs have been described as acting as feature detectors, with different types showing specificity for motion, form or color conferred by the spatial, spectral, and temporal characteristics of their receptive field (Field and Chichilnisky, 2007; Gollisch and Meister, 2010; Baden et al., 2016). Here, we review evidence for the role of midget RGC receptive fields as the first step for detection of two elementary visual features, (1) hue detectors which encode information about spectral reflectances of surfaces as red, green, blue and yellow percepts, (2) high acuity edge detectors which encode the boundaries of objects as required for form vision.

Because their receptive fields exhibit both spectral and spatial opponency, midget RGCs respond to both chromatic and achromatic edges and thus confound the two (Wiesel and Hubel, 1966). Like all RGCs, midget RGCs encode and transmit information to the brain in binary, as all-or-nothing action potentials. A downstream neuron has no way of knowing, from an individual midget RGC's response, whether the midget RGC responses represent the chromatic or spatial structure of a stimulus. At the level of perception, however, we can distinguish between achromatic and equiluminant chromatic edges, even though individual midget RGCs cannot. How and where the spectral and spatial information encoded by midget RGCs is extracted remains one of the most important unanswered questions of primate vision.

Midget RGCs provide, arguably, the best model for linking low-level receptive fields to perception. Understanding how color and spatial information are encoded may provide insight into general organizational principles employed by neural circuits to parse specific features of a stimulus. Furthermore, restoration of color and spatial vision are an important goal for retinal prosthetics, some of which must replace the upstream circuitry that defines the midget RGC receptive field (Yue et al., 2016). Efforts to restore these fundamental aspects of visual perception may benefit from a better understanding of how they are computed in normal vision.

#### RECEPTIVE FIELDS

All receptive fields are built from the photoreceptor outputs (**Figure 1A**). The photoreceptors' output encodes a single variable: the number of photons absorbed (Rushton, 1972; Baylor et al., 1987). An important implication is that wavelength and intensity are interchangeable and, under the right conditions, any two lights differing in wavelength can be "substituted silently" for each other (Estevez and Spekreijse, 1982). For example, the probability of photon absorption by an M-cone is the same for 467 and 582 nm lights, thus the response of the M-cone shown in **Figure 1B** to the two lights will be indistinguishable. Meanwhile, a 535 nm light with twice the probability of photon absorption can be matched by doubling the intensity of the 467 nm light.

The visual system extracts information about wavelength and spatial contrast by virtue of receptive fields that compare the outputs of multiple cones. The basic computation for extracting wavelength is a comparison between cones of different spectral types, while spatial contrast requires comparing neighboring cones at different spatial locations, regardless of type (Calkins and Sterling, 1999). The characteristics of receptive fields form the foundation of each color vision model discussed here.

#### WHAT IS THE OPTIMAL RECEPTIVE FIELD FOR SPATIAL VISION?

Because midget RGCs are implicated in high acuity form vision, any discussion of their color-coding role must also include their role in spatial-coding. The first step of spatial vision requires delineating the boundaries of objects, essentially performing an edge detection task.

# Spatial Opponency

By comparing the relative activity of cones at different locations, spatially opponent receptive fields signal spatial contrast rather than raw quantal catch (Srinivasan et al., 1982). For lowlevel edge detectors, circularly symmetric center and surround receptive fields are optimal and will provide sensitivity to all edges, regardless of their orientation (Marr and Hildreth, 1980).

#### Spectral Opponency

While spatial vision is sometimes assumed to operate only on light intensity (Marr, 1982; Billock et al., 1996), equiluminant edges are also common in natural scenes (Hansen and Gegenfurtner, 2009). Accordingly, an optimal edge detector would be sensitive to all edges regardless of whether the edge is defined by a change in wavelength or intensity. Thus, an optimal edge-detecting receptive field might not just be spatially opponent, but also spectrally opponent. In this case, the purpose of spectral opponency is not to signal the hue of a surface but rather an edge defined by spectral contrast.

#### WHAT IS THE OPTIMAL COLOR-CODING RECEPTIVE FIELD FOR HUE PERCEPTION?

In the natural world, most colors we perceive are from lights reflected from objects. The purpose of hue perception is to provide information about the surface reflectance of objects, which, in turn, tells us about their internal contents or state. For example, we know the ripeness of fruit and when children are getting sunburned from their surface colors. However, there are significant challenges to this task. Individual cones themselves are not selective for the distribution of wavelengths reflected from a surface. If L-cones are active, light could be coming from a red surface reflecting only long wavelengths, a yellow surface reflecting both middle and long wavelengths, a violet surface reflecting both short and long wavelengths or a white surface reflecting all wavelengths. In addition, information from any individual cone will be further confounded by the spectral characteristics of the illuminant. For example, the amount of illumination from blue sky light relative to direct sunlight changes throughout the day. As a result, the illuminant color can vary from blue to yellow (Foster, 2011; Pauers et al., 2012; Spitschan et al., 2016; Woelders et al., 2018). The ideal receptive fields for serving hue perception would be designed to help extract surface spectral reflectance independent of the illuminant. Here we discuss the features of theoretical receptive fields optimized to overcome the challenges associated with consistently signaling hue, independent of any underlying neural substrates.

#### Spectrally Opponent

Color vision is the ability to discriminate between different wavelengths, independent of intensity (Jacobs, 2018). Receptive

fields with spectrally opponent interactions can extract wavelength information and thus carry color information (Paulus and Kroger-Paulus, 1983; Neitz and Neitz, 2011; Chang et al., 2013). However, cone opponent receptive fields are not necessarily optimized for hue perception.

## Spatially Coextensive

The first receptive field proposed to create a "pure color cell," was the single opponent receptive field, which exhibits spectral opponency without any spatial opponency (**Figure 2A**). Also called spatially co-extensive or Type II (Wiesel and Hubel, 1966; Crook et al., 2009), this receptive field provides color selectivity, the ability to extract spectral information unconfounded by spatial information. Spatially co-extensive, spectrally opponent receptive fields like **Figure 2A** would be theoretically color selective in that they respond to chromatic stimuli, but not achromatic patterns. However, these receptive fields act as simple wavelength detectors and cannot compensate for the changes in illuminant discussed above.

### Double Opponency

To consistently signal hue, an optimal color-coding receptive field must compensate for the changes in illuminant discussed above. Double-opponent receptive fields, superimposing two opposing, spectrally and spatially opponent receptive fields (**Figure 2A**) have been proposed to help provide this color constancy (Daw, 1973; D'Zmura and Lennie, 1986). Double opponent receptive fields exploit the fact that, in the natural world, hue typically changes abruptly at object boundaries while illumination changes slowly across a visual scene. When the center receives light from the edge of an object surface, some light falling in the surround is reflected from other objects in the scene under the same illuminant. If the illuminant changes to have more longwavelength light, the increased L-cone stimulation in the center is opposed by greater L-cone stimulation in the surround, and ideally, the change in illumination is removed from the visual signal. Thus, double opponent receptive fields confer sensitivity to chromatic contrast at the edges of objects while remaining relatively insensitive to global changes in illumination.

### Trichromatic

Normal humans are trichromats and a special requirement of optimal color coding for trichromats is that the receptive fields must compare all three cone types. This is because for neurons comparing only two out of the three cone types, a change in activity in the unsampled cone will not change the hue signaled by that neuron. For example, an L vs. M opponent neuron without S-cone input, as in **Figure 2A**, cannot discriminate between a red surface reflecting only long wavelengths and a violet surface reflecting both long and short wavelengths (Fuld et al., 1981).

# Low Spatial Resolution

If the ideal retina is composed of multiple types of feature detectors, spatial constraints must be considered, and the relative

density of any one type should be no higher than required to serve its specific function. The color of a surface tends to be consistent all across it. Thus, in contrast to spatial vision, that requires a high density of detectors to capture the fine details of the shape of objects, hue detectors can accurately capture surface colors using a much lower resolution array of detectors. In summary, the ideal trichromatic hue-encoding system is a relatively sparse array of receptive fields with structures that are double-opponent and receive input from all three types of cones.

### INTERPRETING MIDGET RGC RECEPTIVE FIELDS

Early models linking L vs. M midget RGCs to visual perception focused on either spatial or spectral opponency in isolation. Models focusing on their spectral opponency emphasized their potential role in encoding red and green hues. In contrast, models accounting only for achromatic spatial opponency lead to the perspective that spectral opponency is an unintended consequence of trichromacy and may be considered "poor engineering" (Marr, 1982).

### Are Midget RGC Receptive Fields Optimal for Hue Perception?

The earliest models followed the first parvocellular LGN (P cell) recordings (De Valois et al., 1966; Wiesel and Hubel, 1966), which have similar receptive field properties as their L vs. M midget RGC inputs. At the time, opponent process theory was still highly controversial (Hurvich and Jameson, 1957) and the discovery of color-opponent neurons in the visual system was groundbreaking. The resulting hypothesis that the parvocellular LGN projections of midget RGCs are responsible for red-green hue perception arguably played a large role in shaping later research. Further, spatial opponency and the resulting responses to achromatic and spatially-structured stimuli were overlooked in many accounts of the physiological basis of hue perception.

In emphasizing, the proposed role of midget RGCs in mediating red-green hue percepts, it was argued that the optimal color-coding receptive field, was one in which an L-cone is

surrounded entirely by M-cones, or vice versa. This receptive field, which would seem to require some cone-specific selective wiring, maximizes the spectral difference between the center and surround to maximally decorrelate the outputs of Land M-cones' overlapping spectral sensitivities (**Figure 1A**; Buchsbaum and Gottschalk, 1983; Párraga et al., 2002; Sun et al., 2006). The "selective-wiring" model in **Figure 2B** was challenged by theoretical studies demonstrating that mixed L/Mcone receptive fields could generate sufficient spectral opponency (Paulus and Kroger-Paulus, 1983; Lennie et al., 1991). Though still debated by some (Lee, 1996; Wool et al., 2018), there is, at most, only a slight functional bias toward selective wiring (Buzás et al., 2006; Field et al., 2010).

A lack of selective wiring may be one argument against the idea that midget RGCs are optimized for hue perception. However, more importantly, from above, the ideal trichromatic hue-encoding system is a relatively sparse array of receptive fields with structures that are double-opponent and receive input from all three types of cones. The common L vs. M midget RGCs do not conform to any of these theoretical features of hueencoding neurons. While our theoretical discussion cannot rule out a contribution to hue, we can conclude L vs. M midget RGCs, by themselves, are "non-optimal" for hue perception.

# Are Midget RGC Receptive Fields Optimal for Spatial Vision?

Near the fovea, the midget RGC's receptive field center represents the cone providing direct input to the midget bipolar cell, while the surround is formed by feedback from horizontal cells contacting neighboring cones (**Figure 1C**; Verweij et al., 2003). This feedback weights each cone's response by the quantal catch in neighboring cones, essentially subtracting out the mean light level and allowing each individual cone feeding the center of midget RGCs to encode spatial contrast (Jadzinsky and Baccus, 2013). In the central retina, midget RGCs set the limits of human visual acuity (Rossi and Roorda, 2010).

Indeed, theoretical attempts to derive an optimal receptive field for the first step of spatial vision have all converged on the same circularly symmetric center-surround organization (Marr and Hildreth, 1980; Srinivasan et al., 1982; Atick et al., 1992), often modeled as a Difference of Gaussians (Enroth-Cugell and Robson, 1966; Croner and Kaplan, 1995; Dacey et al., 2000). As **Figure 2C** demonstrates, center-surround receptive fields are ideal edge detectors for encoding spatial contrast.

In contrast to early ideas emphasizing their putative role in color perception, more recent research into the evolution of the primate visual system provides a useful context for a modern understanding of L vs. M midget RGC function. Though sometimes compared to the X-cells of the mammalian retina, there is no true homolog to the midget circuit prior to prosimians (Peng et al., 2019). The midget RGC circuitry evolved before uniform trichromacy (Nathans, 1999). In dichromats, for example, with only S- and L-cones, the midget RGC's antagonistic center-surround receptive field functions as an achromatic edge detector by comparing the input of a single L-cone to surrounding L-cones (**Figure 1D**).

### Interim Conclusions

The receptive field structure of L vs. M midget RGCs is consistent with a role in edge detection. Their ability to respond to equiluminant edges defined only by wavelength differences makes visible forms that would be otherwise invisible. Spectral opponency can also increase the signal-to-noise ratio for edges defined by both intensity and wavelength. The idea that spectral opponency in L vs. M midget RGCs could enhance edge detection rather than contribute to color perception raises an important point. A response to wavelength changes does not imply a causal role in hue perception. As introduced above, hue perception requires detectors that will not respond to black-white edges.

In conclusion, while it may be arguable whether or not midget L vs. M RGCs are ideal achromatic encoders, it is indisputable that they are far from ideal for red-green hue encoding. This leaves two major unanswered questions: what is the physiological basis for hue perception and what role do midget RGCs play? Several different theories involving both the spectral and spatial aspects of midget RGC receptive fields have been proposed as tentative answers to this question. We next review the two main classes of explanation: multiplexing and parallel processing.

## MULTIPLEXING MODELS

The first class of models share the idea that each individual midget RGC does "double duty," carrying information for both color vision and achromatic spatial vision, which are extracted by circuitry at higher levels of processing in the geniculostriate pathway. It has been said that red-green and black-white percepts are "de-multiplexed" by downstream circuits (Boycott and Wässle, 1999; Lennie and Movshon, 2005). The idea of multiplexing originated as an analog to attempts to efficiently compress chromatic and spatial information for color televisions (Ingling and Martinez-Uriegas, 1983; Derrico and Buchsbaum, 1991).

The most common models, summarized in **Figure 2D**, combine the outputs of midget RGCs to perform two main transforms: one to extract spectral information by removing spatial correlations and another to extract achromatic spatial information by removing spectral information. The achromatic channels (L + M) sum L- and M-center midget RGC signals to serve as intensity contrast detectors. The putative chromatic channels (L vs. M) difference L-ONcenter with M-ON center receptive fields to produce spatially coincident spectrally opponent receptive fields, as discussed above (**Figure 2A**). Accordingly, achromatic spatial structure will be absent in the chromatic channel, resulting in a low-pass chromatic filter, while the achromatic channel will retain the band-pass spatial tuning necessary for spatial vision.

A separate aspect of one of the best-known versions, the De Valois and De Valois (1993) multi-stage color model, was the need to reconcile the difference in cone inputs measured for L vs. M cone-opponent neurons and the opponent receptive fields required to account for hue perception, illustrated in **Figure 3A**. The four fundamental hue sensations are often assumed to

represent the responses of four groups of hue-encoding neurons. Over the last 50 years, there have been different ideas about the exact nature of the cone inputs to the four fundamental hues. However, a convergence of modern evidence from experiments directly measuring hue perception indicate that all three cone types contribute to each hue in the following combinations: L + S vs. M for red-green and M + S vs. L for blue-yellow, respectively (**Figure 3A**; Wooten and Werner, 1979; Drum, 1989; Webster et al., 2000a; Schmidt et al., 2016).

One of the great insights of the DeValois and DeValois model was that hue perception requires S-cone inputs to L vs. M opponent pathways (Wooten and Werner, 1979; Drum, 1989; Webster et al., 2000b). As an ad hoc solution to the discrepancy between L vs. M midget RGCs and the receptive fields required for hue perception, their multi-stage color model proposed that the necessary S-cone input to an L vs. M channel is accomplished by mixing in the outputs of S-cone opponent neurons (**Figure 2E**).

### Evaluating the Double Duty Hypothesis

The DeValois and DeValois model was firmly based on the most recent anatomical, psychophysical and physiological results of the time, yet a number of assumptions were necessary where open questions remained. We can now revisit these assumptions in light of the research published in the 25 years since the multistage model was first proposed. One example is their explanation of how the required S-cone inputs from small bistratified RGCs are added in the process of building cortical receptive fields for hue perception. More recently, the classification of small bistratified RGCs as single opponent "pure color cells" has been called into question [compare **Figures 1E**, **2A** (Field et al., 2007; Tailby et al., 2010); but see Crook et al. (2009)]. Thus, small bistratified RGCs and their S-ON projections may also confound spatial and spectral information. Moreover, the S-cone ON neurons were later identified as a part of the functionally distinct koniocellular pathway (Martin et al., 1997) and there is no direct evidence for specific circuits combining signals from the koniocellular and parvocellular pathways.

While the theoretical L-M and L + M channels would decorrelate the outputs of midget RGCs, it has been argued that not all decorrelations are created equal (Pitkow and Meister, 2012) and the benefits depend on how these channels are implemented by neural circuitry. In general, however, asking a neuron to perform two jobs simultaneously has been said to ensure that both are done poorly (Sterling and Laughlin, 2017). Moreover, there don't appear to be any true modern examples of multiplexing RGCs involving two functions performed simultaneously. Perhaps the closest parallel is the fact that the same RGCs serve both photopic and scotopic vision, however, these functions are primarily performed separately under different conditions (Field et al., 2009; Grimes et al., 2014). Other examples of multiplexing RGCs involve one stimulus dimension modulating the encoding of another (Deny et al., 2017), however, this is different from two functions being encoded simultaneously.

The "de-multiplexing" multi-stage models are the result of speculation about the type of computation that would be required to produce selective detectors for wavelength and spatial contrast from combinations of spectrally opponent center-surround neurons, however, they lack firm experimental evidence from cortical physiology (Lee, 2008). They have also been criticized from an image compression standpoint, with the argument that decorrelation of chromatic and spatial information is best done early, ideally before transmission through the optic nerve (Derrico and Buchsbaum, 1991). In contrast, an effort to test demultiplexing models concluded the two dimensions cannot be disentangled in the early visual system (Kingdom and Mullen, 1995). Moreover, the most successful models based on the "double duty" hypothesis do not make predictions about both spatial and spectral responses (Rider et al., 2018).

The assumption that different aspects of color vision are all based on the same underlying neural substrates (e.g., L vs. M midget RGCs) has resulted in a tendency to expect the visual system to somehow extract hue information from the midget RGCs' receptive field output. However, the computational complexity required to separate chromatic from spatial information at subsequent stages of visual processing should not be underestimated. One higher stage is proposed to decorrelate spatial and spectral information, a second higher stage to add the required S-cone input (**Figure 2E**) and yet an additional stage, that has not been incorporated into current demultiplexing models, to generate the double opponent receptive field structure required to create neurons that are able to contribute to invariant hue-encoding of spectral reflectance.

# Multiplexing in the Light of Information Coding in the Retina

The need to compress RGC axons down to a 2 mm cable is often referred to as an "information bottleneck" within the visual system. Proponents of multiplexing models might claim superiority on this account: combining color and spatial information into one RGC could reduce the number axons in the optic nerve without reducing the transmission of information. Indeed, there are about six to seven million cones in a human eye and only about a million optic nerve fibers (Sterling and Laughlin, 2017). However, this represents the situation in the peripheral retina where convergent input from a large number of cones to each RGC results in a huge reduction in visual acuity relative to what could be supported by the cone mosaic. The loss of spatial information from this convergence is never recovered at higher levels in the visual pathway.

At the time multiplexing models were first proposed, a dominant view on the purpose of retinal function was to reduce redundancy and compress visual information to fit through the optic nerve, with the computations defining visual perception occurring in the cortex (Barlow, 1961). However, contrary to the idea of information compression, in the fovea there is a divergence from cones to RGCs such that the ratio is about 3:1 RGCs:cone. Recent work in non-primate animal models has contributed to a growing appreciation for the diversity of RGC types (Wässle, 2004; Baden et al., 2016) and the sophisticated computations occurring within the retina (Gollisch and Meister, 2010; Wienbar and Schwartz, 2018). Even near the primate fovea,

units from De Valois (2004). (B) Percepts associated with stimulating individual L- and M-cones in isolation may represent the responses of two types of individual midget RGCs, a larger group of achromatic contrast detectors and a smaller group that function as hue detectors. Adapted from Sabesan et al. (2016).

many of the at least 20 different RGC types are represented (Percival et al., 2013; Peng et al., 2019). What failed to be appreciated in the early work on the primate retina is that, with the exception of the midget RGCs, for which there are two for every cone (one ON and one OFF), each of the twenty or more RGC types represents a small percentage of the total. Thus, the retina is a massively parallel processing machine with many different types of RGCs carrying out diverse functions most of which operate at low spatial acuity and require only sparse representations. Thus, as discussed below, it seems plausible that, consistent with the current understanding of the plan of the retina, hue perception could be mediated by a relatively sparse set of RGCs that serve as hue detectors.

Recent considerations of the metabolic cost of information transmission have also questioned the efficiency of compressing information into a smaller set of RGCs, and revealed a more nuanced set of constraints defined not by the number of axons, but by their diameter. RGC axon diameters scale linearly with average firing rate (Perge et al., 2009, 2012). This relationship forms the basis of a law of diminishing returns – metabolic cost increases supralinearly with axon diameter while the information per spike falls as spike rate increases (Rieke et al., 1997; Koch et al., 2006).

A population of parallel neurons, each carrying as much information as possible, is the most efficient coding scheme (Laughlin, 2001). The midget RGC circuit, acting as an edge contrast detector, is already a model of energy-efficient parallel processing – each cone in the central retina contacts a single ON and OFF midget bipolar cell (**Figure 1C**). This allows baseline activity to remain low while the response ranges of each ON and OFF cell are devoted to signaling increments or decrements, respectively, in parallel (Berry et al., 1997). Theoretically, multiplexing increments and decrements would double the information per axon, thus halving the number of axons while increasing axon diameter (and thus energetic cost) fourfold (Sterling and Laughlin, 2017). Taking these costs into account creates a strong pressure for more types of RGCs with thinner axons and lower spike rates, consistent with a parallel processing model.

### PARALLEL PROCESSING MODELS

L vs. M midget RGCs receptive fields are near optimal for high acuity spatial vision and are poorly suited for encoding hue. These facts plus the computational complexity required to separate hue from spatial information from L vs. M midget RGCs and a newer understanding of information processing in the retina has led to the suggestion of an alternative hypothesis: that the L vs. M midget RGCs' only serve spatial vision – the function for which they are optimized – and they do not contribute to red-green hue perception. According to this idea, the front-end computations for hue perception are served, in parallel, by a second population of RGCs that have receptive field properties that are specifically optimized as hue detectors (Rodieck, 1991; Calkins and Sterling, 1999; Schmidt et al., 2014; Neitz and Neitz, 2016). The "pixel density" of the L vs. M midget RGCs is high to serve high spatial acuity but, as introduced above, the proposed parallel set of hue detectors need to be only relatively sparse to recover surface reflectance with much lower spatial acuity.

# Separate Subtypes of Midget RGCs for Hue and Spatial Vision

If L vs. M midget RGCs mediate spatial vision, which RGCs encode color? To match the acuity of our hue perception, an undiscovered RGC type would need roughly the sampling density

of the S-cone mosaic (Mullen, 1985; Calkins and Sterling, 1999). The lack of alternative hue encoders makes midget RGCs an obvious candidate. We have proposed that the four fundamental hues are encoded by a small subset of L vs. M midget RGCs receiving input from neighboring S-cones (**Figure 2F**; Schmidt et al., 2014). The resulting L + S vs. M and M + S vs. L midget RGCs match the cone inputs for the four fundamental hues, as well as a population of rare RGCs (De Monasterio and Gouras, 1975; De Monasterio et al., 1975) and LGN neurons (Derrington et al., 1984; Tailby et al., 2008). These rare RGCs should not be ignored, as a potential hue-encoding RGC type needs to be only ∼5–10% of foveal RGCs to match color acuity (Calkins and Sterling, 1999).

Each S-cone has a surround created by S-cone-preferring HII horizontal cells. Hue-encoding receptive fields are proposed to arise from the superposition of the S-cone center-surround receptive field with the L vs. M cone center-surround. These two are predicted to be combined by feedforward synapses (Puller et al., 2014) from HII horizontal cells to L vs. M midget bipolar cells. The result simultaneously creates the S-cone input to L vs. M opponent cells and double opponency required to create nearly ideal hue-encoding RGCs (discussed in detail in Neitz and Neitz, 2016). Indeed, computational models of such color-coding midget RGCs can account for previously unexplained color phenomena, such as unique hues and variations in hue perception with L/M-cone ratios (Schmidt et al., 2016).

Key strengths of this parallel processing hypothesis are its simplicity and specificity. All the key features of ideal hue-encoding neurons are proposed to be created in the retina simply by feed-forward from HII horizontal cells at the level of the bipolar cells in a single step as opposed to the idea of multiple stages at unspecified higher levels. The predicted mechanism for a parallel set of double opponent neurons includes specific cell types, neurotransmitters, and biophysical mechanisms (Puller et al., 2014). While this level of detail may invite additional criticism, it also generates testable predictions that can be addressed by experiment. In contrast, the DeValois and DeValois model specified the computations for their "de-multiplexing" neurons, but not the underlying neural substrates.

### Recent Research Supporting Parallel Processing Models

The parallel processing approach draws from the idea that each RGC's receptive field acts as a feature detector, tuned to extract a specific type of visual information, such as direction, defocus, edges or hue. From this perspective, L vs. M midget RGCs that respond equally to red–green and black–white edges are not multiplexing, nor even confounding, red–green and black–white signals. Rather, they are reliably signaling a particular feature – the presence or absence of an edge. Accordingly, hue-encoding RGCs are signaling a different feature – the detection of a specific spectral reflectance distribution (**Figure 3A**). Importantly, these RGCs would not be directly responsible for percepts of hue and edges, but instead we are proposing that they serve as front-end mechanisms for making these computations.

A particularly influential line of evidence has been provided by high-precision psychophysics experiments enabled by the development of adaptive optics systems capable of delivering small spots of light while simultaneously imaging the underlying mosaic of cones (Harmening et al., 2014). Early experiments investigating spatial acuity found individual midget RGCs set the limit for spatial resolution (Rossi and Roorda, 2010). These results are inconsistent with models proposing midget RGC outputs are combined to "de-multiplex" color and spatial information. The loss of spatial information from the convergence in **Figure 2D** can never be recovered at higher levels in the visual pathway.

The unprecedented precision provided by adaptive optics imaging systems combined with recent advances in eye tracking and cone type classification (Sabesan et al., 2015) have enabled highly precise psychophysics experiments investigating the percepts resulting from single cones (reviewed by Kling et al., 2019). The responses are highly consistent and reflect activity in the midget RGCs with single cone centers (Schmidt et al., 2019). Consistent with parallel processing of hue and spatial information by separate types of midget RGCs, stimulation of most L/M-cones in the central retina results in percepts of white, with only a small subset eliciting color percepts (**Figure 3B**; Sabesan et al., 2016; Schmidt et al., 2018a,b). Further, the homogeneity of the surrounding cone type had no effect on which cones were associated with a perceived color, arguing against the idea that midget RGCs with strong L vs. M opponency serve hue perception. These experiments were the first to target stimuli to single cones of a known type and represent a major advance in linking perception to underlying neural substrates in awake, behaving humans and the results will undoubtedly continue to challenge long-held assumptions.

## HOW DOES THE CORTEX USE WAVELENGTH INFORMATION?

Hue perception is just one of many functions that uses wavelength information. For example, the retina contains photopigments such as melanopsin and neuropsin, which carry additional wavelength information, but have no impact on the dimensionality of color vision (Horiguchi et al., 2013; Buhr et al., 2015). There are many examples of neurons carrying temporal, spatial or spectral information that is not extracted for visual perception, including color-opponent V1 neurons responding to chromatic stimuli that are not perceived (Gur and Snodderly, 1997; Jiang et al., 2007).

In fact, many RGCs do not contribute to conscious perception at all, but instead mediate functions such as visually guided movements or circadian photoentrainment (for review, see Neitz and Neitz, 2016). Wavelength information is extracted by several types of spectrally opponent RGCs for many functions other than color vision. For example, circadian rhythm photoentrainment and the pupillary light reflex are mediated by intrinsically photosensitive RGCs (reviewed in Do and Yau, 2010). Their receptive fields match the wavelength-encoding, single opponent receptive fields discussed above (**Figure 2A**;

Dacey et al., 2005) – ideal for measuring the changes in chromaticity of ambient light throughout the day (Pauers et al., 2012; Spitschan et al., 2017) but they do not contribute to hue perception.

Several lines of evidence indicate that the ability to detect red-green edges is a distinct feature encoded separately from the ability to classify the appearance of lights as red or green. For example, patients with cerebral achromatopsia who suffer a total loss of hue perception, but still can detect chromatic borders, perceive shape from color and discriminate the direction in which colored patterns move (Cowey and Heywood, 1997). The existence of multiple mechanisms and uses for wavelength information also seems evident when comparing the cone inputs mediating color detection and color appearance. The studies identifying L + S vs. M and M + S vs. L as the cone inputs to hue perception measured color appearance (Wooten and Werner, 1979; Drum, 1989; Webster et al., 2000a; Schmidt et al., 2016). However, the classic psychophysical experiments that identified L vs. M and S vs. L + M as the "cardinal directions of color space" (Krauskopf et al., 1982), measured detection. Krauskopf et al. (1982) noted the disparity between their cardinal directions and the red-green (L + S vs. M) and blue-yellow (M + S vs. L) hue axes of color appearance and later questioned the evidence for cardinal mechanisms (Krauskopf, 1997).

There is common ground between multiplexing and parallel processing models. In discussing the abundance of chromatic cortical neurons, DeValois and DeValois argue that only a few are responsible for the specification of color, while the majority instead use color information to specify the spatial (or other) characteristics of stimuli. A problem was a lack of agreement on which cells were relevant for hue perception. Though their proposed color transformations were not consistent with the majority of published cortical color tuning studies, DeValois and DeValois pointed out inconsistencies in the literature and claimed one could "cite some cortical study in support of (or against) almost any suggestion about cortical color processing" (De Valois and De Valois, 1993) We argue a similar situation exists today in the retina where different studies can be sited in support or against the existence of S-cone inputs to midget RGCs [for example, compare the cone opponency reported by De Monasterio and Gouras (1975), Sun et al. (2006), and Field et al. (2010)].

### FUTURE DIRECTIONS

Both the parallel processing and multiplexing models would benefit from experiments linking the theories to their underlying neural substrates. However, an overarching difficulty for resolving the controversy over parallel vs. multiplexing theories is that each point of view reflects a deep-seated theoretical conviction. For those preferring the multiplexing view of L vs. M midget RGCs, "If the color signal is extractable, it makes little sense not to use it" (Billock et al., 1996). From a parallel processing standpoint, encoding color and spatial vision, two of the most fundamental aspects of visual perception, in a single binary channel makes little sense (Calkins and Sterling, 1999) and the information gained must outweigh the cost of extracting a color signal (Laughlin et al., 1998).

Thus, further experiments to characterize the response properties of visual neurons alone are not going to settle the controversy. Initial surveys of cone inputs to neurons in the retinal and LGN reported S-cone input to a subset of L vs. M neurons (De Monasterio and Gouras, 1975; De Monasterio et al., 1975; Derrington et al., 1984) and later surveys confirmed these findings (Tailby et al., 2008; Field et al., 2010). However, skeptics of the parallel processing models favor a study by Sun et al. (2006) in which the authors recorded from a large population of midget RGCs and concluded S-cone input was unlikely (Sun et al., 2006). An underlying problem is that the answers depend on how you ask the question. Results from receptive field measurements are a function of stimulus choice. For example, a full-field stimulus (Lee et al., 1998) may have reduced S-cone responses by driving the antagonistic S-cone surround receptive field mediated by HII horizontal cell feedback (Dacey et al., 1996). Indeed, the Sun et al. (2006) experiments did not detect S-OFF midget RGCs, despite a growing consensus that these neurons make up 5–10% of OFF midget RGCs in the macaque central retina (Klug et al., 2003; Field et al., 2010; Tsukamoto and Omi, 2015; Patterson et al., 2019). Taken together, these results further demonstrate the need to account for both the spatial and spectral dimensions of midget RGC receptive fields.

Consideration of underlying theoretical perspectives and stimulus biases will be essential for designing future experiments linking color vision models to their underlying neural substrates. Also, a broader perspective may help answer the larger questions about how our eye and brain process visual information. Hopefully, future research using cutting-edge technologies will provide satisfying explanations for long unanswered mysteries of vision.

# AUTHOR CONTRIBUTIONS

SP wrote the manuscript. MN and JN edited the final version of the manuscript.

# FUNDING

This work was supported by NIH grants R01EY027859 (JN), T32EY07031 (SP), T32NS099578 (SP), P30EY001730 (Core Grant for Vision Research), and Research to Prevent Blindness. The National Institute of Health contributed to the salaries of the authors and the institutional facilities. A research to prevent blindness unrestricted grant supports the authors research efforts in the Department of Ophthalmology.

# ACKNOWLEDGMENTS

We thank Steve Buck and Ram Sabesan for helpful discussions.

## REFERENCES

fnins-13-00865 August 16, 2019 Time: 15:56 # 10




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Patterson, Neitz and Neitz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership