<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Neurosci.</journal-id>
<journal-title>Frontiers in Computational Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-5188</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fncom.2014.00085</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research Article</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Finding and recognizing objects in natural scenes: complementary computations in the dorsal and ventral visual systems</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Rolls</surname> <given-names>Edmund T.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="author-notes" rid="fn001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://community.frontiersin.org/people/u/16170"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Webb</surname> <given-names>Tristan J.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://community.frontiersin.org/people/u/133027"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computer Science, University of Warwick</institution> <country>Coventry, UK</country></aff>
<aff id="aff2"><sup>2</sup><institution>Oxford Centre for Computational Neuroscience</institution> <country>Oxford, UK</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Hans P. Op De Beeck, University of Leuven (KU Leuven), Belgium</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Hans P. Op De Beeck, University of Leuven (KU Leuven), Belgium; Da-Hui Wang, Beijing Normal University, China</p></fn>
<fn fn-type="corresp" id="fn001"><p>&#x0002A;Correspondence: Edmund T. Rolls, Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK e-mail: <email>edmund.rolls&#x00040;oxcns.org</email></p></fn>
<fn fn-type="other" id="fn002"><p>This article was submitted to the journal Frontiers in Computational Neuroscience.</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>12</day>
<month>08</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<volume>8</volume>
<elocation-id>85</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>05</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>07</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2014 Rolls and Webb.</copyright-statement>
<copyright-year>2014</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract><p>Searching for and recognizing objects in complex natural scenes is implemented by multiple saccades until the eyes reach within the reduced receptive field sizes of inferior temporal cortex (IT) neurons. We analyze and model how the dorsal and ventral visual streams both contribute to this. Saliency detection in the dorsal visual system including area LIP is modeled by graph-based visual saliency, and allows the eyes to fixate potential objects within several degrees. Visual information at the fixated location subtending approximately 9&#x000B0; corresponding to the receptive fields of IT neurons is then passed through a four layer hierarchical model of the ventral cortical visual system, VisNet. We show that VisNet can be trained using a synaptic modification rule with a short-term memory trace of recent neuronal activity to capture both the required view and translation invariances to allow in the model approximately 90% correct object recognition for 4 objects shown in any view across a range of 135&#x000B0; anywhere in a scene. The model was able to generalize correctly within the four trained views and the 25 trained translations. This approach analyses the principles by which complementary computations in the dorsal and ventral visual cortical streams enable objects to be located and recognized in complex natural scenes.</p></abstract>
<kwd-group>
<kwd>object recognition</kwd>
<kwd>invariance</kwd>
<kwd>saliency</kwd>
<kwd>inferior temporal visual cortex</kwd>
<kwd>trace learning rule</kwd>
<kwd>VisNet</kwd>
</kwd-group>
<counts>
<fig-count count="6"/>
<table-count count="4"/>
<equation-count count="13"/>
<ref-count count="138"/>
<page-count count="19"/>
<word-count count="17384"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="introduction" id="s1">
<title>1. Introduction</title>
<p>One of the major problems that is solved by the visual system in the cerebral cortex is the building of a representation of visual information that allows object and face recognition to occur relatively independently of size, contrast, spatial frequency, position on the retina, angle of view, lighting, etc. These invariant representations of objects, provided by the inferior temporal visual cortex (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>), are extremely important for the operation of many other systems in the brain, for if there is an invariant representation, it is possible to learn on a single trial about reward/punishment associations of the object, the place where that object is located, and whether the object has been seen recently, and then to correctly generalize to other views etc. of the same object (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B86">2014</xref>). Here we consider how the cerebral cortex solves the major computational task of view-invariant recognition of objects in complex natural scenes, still a major challenge for computer vision approaches, as described in the Discussion.</p>
<p>One mechanism that the brain uses to simplify the task of recognizing objects in complex natural scenes is that the receptive fields of inferior temporal cortex neurons change from approximately 70&#x000B0; in diameter when tested under classical neurophysiology conditions with a single stimulus on a blank screen to as little as a radius of 8&#x000B0; (for a 5&#x000B0; stimulus) when tested in a complex natural scene (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>; Aggelopoulos and Rolls, <xref ref-type="bibr" rid="B4">2005</xref>) (with consistent findings described by Sheinberg and Logothetis, <xref ref-type="bibr" rid="B110">2001</xref>). This greatly simplifies the task for the object recognition system, for instead of dealing with the whole scene as in traditional computer vision approaches, the brain processes just a small fixated region of a complex natural scene at any one time, and then the eyes are moved to another part of the screen. During visual search for an object in a complex natural scene, the primate visual system, with its high resolution fovea, therefore keeps moving the eyes until they fall within approximately 8&#x000B0; of the target, and then inferior temporal cortex neurons respond to the target object, and an action can be initiated toward the target, for example to obtain a reward (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>). The inferior temporal cortex neurons then respond to the object being fixated with view, size, and rotation invariance (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>), and also need some translation invariance, for the eyes may not be fixating the center of the object when the inferior temporal cortex neurons respond (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>).</p>
<p>The questions then arise of how the eyes are guided in a complex natural scene to fixate close to what may be an object; and how close the fixation is to the center of typical objects for this determines how much translation invariance needs to be built into the ventral visual system. It turns out that the dorsal visual system (Ungerleider and Mishkin, <xref ref-type="bibr" rid="B127">1982</xref>; Ungerleider and Haxby, <xref ref-type="bibr" rid="B126">1994</xref>) implements bottom-up saliency mechanisms by guiding saccades to salient stimuli, using properties of the stimulus such as high contrast, color, and visual motion (Miller and Buschman, <xref ref-type="bibr" rid="B63">2013</xref>). (Bottom-up refers to inputs reaching the visual system from the retina). One particular region, the lateral intraparietal cortex (LIP), which is an area in the dorsal visual system, seems to contain saliency maps sensitive to strong sensory inputs (Arcizet et al., <xref ref-type="bibr" rid="B5">2011</xref>). Highly salient, briefly flashed, stimuli capture both behavior and the response of LIP neurons (Bisley and Goldberg, <xref ref-type="bibr" rid="B10">2003</xref>, <xref ref-type="bibr" rid="B11">2006</xref>; Goldberg et al., <xref ref-type="bibr" rid="B37">2006</xref>). Inputs reach LIP via dorsal visual stream areas including area MT, and via V4 in the ventral stream (Soltani and Koch, <xref ref-type="bibr" rid="B111">2010</xref>; Miller and Buschman, <xref ref-type="bibr" rid="B63">2013</xref>). Although top-down attention using biased competition can facilitate the operation of attentional mechanisms, and is a subject of great interest (Desimone and Duncan, <xref ref-type="bibr" rid="B21">1995</xref>; Rolls and Deco, <xref ref-type="bibr" rid="B93">2002</xref>; Deco and Rolls, <xref ref-type="bibr" rid="B17c">2005a</xref>; Miller and Buschman, <xref ref-type="bibr" rid="B63">2013</xref>), top-down object-based attention makes only a small contribution to visual search for an object in a complex natural unstructured scene (such as leaves on a tree), increasing the receptive field size from a radius of approximately 7.8 to approximately 9.6&#x000B0; (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>), and is not considered further here. Indeed, in these investigations, multiple saccades were required round the scene to find a target object (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>).</p>
<p>In the research described here we investigate computationally how a bottom-up saliency mechanism in the dorsal visual stream reaching for example area LIP could operate in conjunction with invariant object recognition performed by the ventral visual stream reaching the inferior temporal visual cortex to provide for invariant object recognition in natural scenes. The hypothesis is that the dorsal visual stream, in conjunction with structures such as the superior colliculus (Knudsen, <xref ref-type="bibr" rid="B53">2011</xref>), uses saliency to guide saccadic eye movements to salient stimuli in large parts of the visual field, and that once a stimulus has been fixated, the ventral visual stream performs invariant object recognition on the region being fixated. The dorsal visual stream in this process knows little about invariant object recognition, so cannot identify objects in natural scenes. Similarly, the ventral visual stream cannot perform the whole process, for it cannot efficiently find possible objects in a large natural scene, because its receptive fields are only approximately 9&#x000B0; in radius in complex natural scenes. It is how the dorsal and ventral streams work together to implement invariant object recognition in natural scenes that we investigate here. By investigating this computationally, we are able to test whether the dorsal visual stream can find objects with sufficient accuracy to enable the ventral visual stream to perform the invariant object recognition. The issue here is that the ventral visual stream has in practice some translation invariance in natural scenes, but this is limited to approximately 9&#x000B0; (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>; Aggelopoulos and Rolls, <xref ref-type="bibr" rid="B4">2005</xref>). The computational reason why the ventral visual stream does not compute translation invariant representations over the whole visual field as well as view, size and rotation invariance, is that the computation is too complex. Indeed, it is a problem that has not been fully solved in computer vision systems when they try to perform invariant object recognition over a large natural scene. The brain takes a different approach, of simplifying the problem by fixating on one part of the scene at a time, and solving the somewhat easier problem of invariant representations within a region of approximately 9&#x000B0;.</p>
<p>For this scenario to operate, the ventral visual stream needs then to implement view invariant recognition, but to combine it with some translation invariance, as the fixation position produced by bottom up saliency will not be at the center of an object, and indeed may be considerably displaced from the center of an object. In the model of invariant visual object recognition that we have developed, VisNet, which models the hierarchy of visual areas in the ventral visual stream by using competitive learning to develop feature conjunctions supplemented by a temporal trace or by spatial continuity or both, all previous investigations have explored either view or translation invariance learning, but not both (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>). Combining translation and view invariance learning is a considerable challenge, for the number of transforms becomes the product of the numbers of each transform type, and it is not known how VisNet (or any other biologically plausible approach to invariant object recognition) will perform with the large number, and with the two types of transform combined. Indeed, an important part of the research described here was to investigate how well architectures of the VisNet type generalize between both trained locations and trained views. This is important for setting the numbers of different views and translations of each object that must be trained.</p>
<p>The specific goals of the research and simulations described here were as follows. (1) To demonstrate with a biologically plausible model of the ventral visual system how it could operate to implement view invariant object/person identity recognition with a generic model of the dorsal visual system that produced fixations on parts of scenes that were salient. How would the combined cortical visual areas operate with the dorsal visual system not encoding object identity but only saliency; and the ventral visual system being unable to find objects efficiently in large natural scenes, but able to perform view invariant object recognition once fixation was close to an object? (2) How closely and effectively would a simple, generic, bottom-up saliency system modeling part of the functions of the dorsal visual system find objects in a complex scene, and how accurately would the center of the object be fixated? The accuracy with which the center of the object is fixated is crucial to understand, for this defines how much translation invariance must be incorporated into the ventral visual system for the whole system to work. (3) Can VisNet be trained for both view and translation invariance? This has not been attempted previously with VisNet, and for that matter view invariant object recognition is not a property of most computer vision models (see Discussion). (4) If VisNet can be trained on both view and translation invariant object identification, can it be trained with sufficient translation invariance to cover the visual angle needed given the inaccuracies of the saliency-based fixation mechanism in finding the center of an object, and yet be trained with sufficient views to provide for view-invariant object identification? (5) How well does VisNet generalize from trained views to untrained views of an object? This is important, for it influences how much training of different views is required, which could have an impact on the capacity of the system, that is on the number of objects or people that it can correctly identify with the required translation invariance. (6) How well does VisNet perform in object identification when the objects appear in natural scenes with fixation not necessarily at the trained location, and when views intermediate to those at which VisNet has been trained are presented? That is, how well under the natural scene conditions can VisNet ignore the background and identify a trained object despite it being presented in a view and position that were not trained?</p>
</sec>
<sec sec-type="methods" id="s2">
<title>2. Methods</title>
<sec>
<title>2.1. Saliency</title>
<p>We chose a bottom up saliency algorithm that is one of the standard ones that has been developed, which adopts the Itti and Koch (<xref ref-type="bibr" rid="B48">2000</xref>) approach to visual saliency, and implements it by graph-based visual saliency (GBVS) algorithms (Harel et al., <xref ref-type="bibr" rid="B40">2006a</xref>,<xref ref-type="bibr" rid="B41">b</xref>). This system performs well, that is similarly to humans, in many bottom-up saliency tasks. The particular algorithm used for the bottom-up saliency was not crucial to the present research, so we chose a generically representative algorithm<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref>. We used static images, so motion was not used to detect saliency. Of course in the human brain, and in a computer application, performance could be made better than described here by using many different cues that can influence saliency, including also color which was disabled in the current algorithm, as VisNet works with grayscale images to help ensure that object shape is being processed, and not a simple feature such as color (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>).</p>
</sec>
<sec>
<title>2.2. Architecture of the ventral visual stream model, VisNet</title>
<p>The architecture of VisNet has been described previously (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>), and is summarized briefly next, with a full description provided in the Appendix. Extensions important for the present research included training in both view and translation invariance, together with careful specification of the learning rate during the presentation of each transform, as there were typically 100 or more transforms of every object to be learned.</p>
<p>Fundamental elements of Rolls&#x00027; <xref ref-type="bibr" rid="B80">1992</xref> theory for how cortical networks might implement invariant object recognition are described in detail elsewhere (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>). They provide the basis for the design of VisNet, which can be summarized as:</p>
<list list-type="bullet">
<list-item><p>A series of competitive networks, organized in hierarchical layers, exhibiting mutual inhibition over a short range within each layer. These networks allow combinations of features or inputs occurring in a given spatial arrangement to be learned by neurons using competitive learning (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>), ensuring that higher order spatial properties of the input stimuli are represented in the network. In VisNet, layer 1 corresponds to V2, layer 2 to V4, layer 3 to posterior inferior temporal visual cortex, and layer 4 to anterior inferior temporal cortex. Layer one is preceded by a simulation of the Gabor-like receptive fields of V1 neurons produced by each image presented to VisNet (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>).</p></list-item>
<list-item><p>A convergent series of connections from a localized population of neurons in the preceding layer to each neuron of the following layer, thus allowing the receptive field size of neurons to increase through the visual processing areas or layers, as illustrated in Figure <xref ref-type="fig" rid="F1">1</xref>.</p></list-item>
<list-item><p>A modified associative (Hebb-like) learning rule incorporating a temporal trace of each neuron&#x00027;s previous activity, which, it has been shown (F&#x000F6;ldi&#x000E1;k, <xref ref-type="bibr" rid="B29">1991</xref>; Rolls, <xref ref-type="bibr" rid="B80">1992</xref>; Wallis et al., <xref ref-type="bibr" rid="B131">1993</xref>; Wallis and Rolls, <xref ref-type="bibr" rid="B130">1997</xref>; Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>; Rolls, <xref ref-type="bibr" rid="B85">2012</xref>), enables the neurons to learn transform invariances.</p></list-item>
</list>
<p>The learning rates for each of the four layers were 0.05, 0.03, 0.005, and 0.005, as these rates were shown to produce convergence of the synaptic weights after 15&#x02013;50 training epochs. 50 training epochs were run.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p><bold>Convergence in the visual system. Right:</bold> As it occurs in the brain. V1, visual cortex area V1; TEO, posterior inferior temporal cortex; TE, inferior temporal cortex (IT). <bold>Left:</bold> As implemented in VisNet. Convergence through the network is designed to provide fourth layer neurons with information from across the entire input retina.</p></caption>
<graphic xlink:href="fncom-08-00085-g0001.tif"/>
</fig>
<p>The developments to VisNet that facilitated this principled approach to the learning rate, combined view and translation invariance learning, etc, and the parameters used, are described in the Appendix.</p>
</sec>
<sec>
<title>2.3. Information measures of performance</title>
<p>The performance of VisNet was measured by Shannon information-theoretic measures that are essentially identical to those used to quantify the specificity and selectiveness of the representations provided by neurons in the brain (Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>; Rolls, <xref ref-type="bibr" rid="B85">2012</xref>). A single cell information measure indicated how much information was conveyed by a single neuron about the most effective stimulus. A multiple cell information measure indicated how much information about every stimulus was conveyed by small populations of neurons, and was used to ensure that all stimuli had some neurons conveying information about them. Details are provided in the Appendix.</p>
</sec>
<sec>
<title>2.4. Training</title>
<p>VisNet was trained on four views spaced 45&#x000B0; apart of each of the 4 objects as illustrated in Figure <xref ref-type="fig" rid="F2">2</xref>. The images of each object were generated from a 3D model using Blender (The Blender Foundation, <ext-link ext-link-type="uri" xlink:href="http://www.blender.org">www.blender.org</ext-link>) so that lighting could be carefully controlled. Each grayscale image of an object was 256 &#x000D7; 256 pixels, with the intensity scaled to be in the range 0&#x02013;255, and the background approximately 127. The object images were pasted into a 512 &#x000D7; 512 gray image to prevent wrap-around effects, prior to the spatial frequency filtering to produce neurons with Gabor-like receptive fields in an emulation of V1 neurons that provided the input to the first layer of VisNet (see Appendix). [We have previously shown that the training need not be on a blank background, provided that the background is not constant across transforms and objects, as will be the case in the natural world (Stringer et al., <xref ref-type="bibr" rid="B117">2007</xref>; Stringer and Rolls, <xref ref-type="bibr" rid="B116">2008</xref>)]</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p><bold>Training images: 4 views of each of 4 objects</bold>. Each image was 256 &#x000D7; 256 pixels.</p></caption>
<graphic xlink:href="fncom-08-00085-g0002.tif"/>
</fig>
<p>Each training image was trained in 25 locations set out in a 5 &#x000D7; 5 rectangular grid with these locations separated by 8 pixels in the training image. To provide an indication of the range of this translation invariance training, the grid extended between the centers of the headlights in the front view of the jeep shown in Figure <xref ref-type="fig" rid="F2">2</xref>. This resulted in 100 transforms of each object to be learned. To enable VisNet to learn invariant representations with the trace synaptic learning rule, all the transforms of one object were shown in a random permuted sequence, the trace was reset, and the procedure was repeated with each of the other objects. 50 training epochs were run, as this was sufficient to produce gradual convergence of the synaptic weights over 15&#x02013;50 epochs, as described in the Appendix.</p>
</sec>
<sec>
<title>2.5. Testing invariant object recognition in natural scenes</title>
<p>Eight of the 12 test scenes are illustrated in Figure <xref ref-type="fig" rid="F3">3A</xref>. Each scene had each of the objects in one of the four poses. The aim of the combined visual processing was for the dorsal visual stream to detect the salient regions in these 12 scenes, and then for the salient regions to be passed to VisNet to perform the view (and translation) invariant object recognition for every object in the scene. VisNet had been trained on the 4 objects in each of the 4 views, but not on the background scenes, and it was part of the task of VisNet to identify each of the four objects in every scene without being affected by the background clutter of each scene (Stringer and Rolls, <xref ref-type="bibr" rid="B114">2000</xref>). The objects used in this investigation were common types of object with which the human visual system performs good view invariant identification, people and vehicles. Two people and two vehicles were chosen to provide evidence on how the system might operate with typical stimuli for which view-invariant identification is necessary and is performed by the human visual system.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p><bold>(A)</bold> Eight of the 12 test scenes. Each scene has 4 objects, each in one of its four views. <bold>(B)</bold> The bottom up saliency map generated by the GBVS code for one of the scenes. The highest levels in the saliency map are red, and the lowest blue. <bold>(C)</bold> Rectangles (384 &#x000D7; 384 pixels) placed around each peak in the scene for which the bottom-up saliency map is illustrated in <bold>(B)</bold>.</p></caption>
<graphic xlink:href="fncom-08-00085-g0003.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3. Results</title>
<sec>
<title>3.1. The operation of the saliency processing</title>
<p>The bottom up saliency map generated by the GBVS code (acting as a surrogate for the dorsal visual system) for one of the scenes is illustrated in Figure <xref ref-type="fig" rid="F3">3B</xref>. The saliency map has of course no indication of which peak is a trained object, nor of which object it might be.</p>
<p>The saliency maps generated by GBVS correspond closely to the saccades and resulting fixations of humans (Itti and Koch, <xref ref-type="bibr" rid="B48">2000</xref>; Harel et al., <xref ref-type="bibr" rid="B40">2006a</xref>,<xref ref-type="bibr" rid="B41">b</xref>). We therefore extracted images from the scene that were at the center of each peak of the saliency map. A weighted centroid was used, as implemented in MATLAB. Each extracted image centered on a peak in the saliency map was 384 &#x000D7; 384 pixels (not the originally trained 256 &#x000D7; 256 size of a training image), because sometimes a saliency peak was not well centered on an object, and we wished to be sure that the whole object was in the image presented to VisNet. Figure <xref ref-type="fig" rid="F3">3C</xref> shows rectangles produced in this way round the 6 most salient regions in the test scene for which the saliency map is shown in Figure <xref ref-type="fig" rid="F3">3B</xref>. Four of the saliency peaks and therefore the rectangles contained trained objects, and two extracted images just salient parts of the background scene in which the trained objects appeared.</p>
<p>The extracted (&#x0201C;foveated&#x0201D;) images of the objects to be presented to VisNet based on saliency are not always well-centered in the 384 &#x000D7; 384 extracted image, and this is clear for one of the objects, the man, as shown in Figure <xref ref-type="fig" rid="F3">3C</xref>.</p>
<p>To provide evidence on the degree of translation invariance that would be required of VisNet given that the center of each image was not always at the peak of the saliency map, so that the extracted image would be offset from a central trained location, the offsets of the saliency peaks from the center of each object image are shown in Figure <xref ref-type="fig" rid="F4">4</xref>. While it is clear that the majority of the offsets of the saliency peak from the center of the object were in the range 0&#x02013;32 pixels, some were beyond this. For this reason, we do not necessarily expect that VisNet, trained on a grid with an offset up to 32 would achieve 100% correct object recognition. The evidence shown in Figure <xref ref-type="fig" rid="F4">4</xref> does provide though the useful indication that training to allow for offsets up to 64 for a 256 &#x000D7; 256 image might improve performance.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p><bold>Distribution of the offsets of the saliency peaks from the center of each object</bold>. The data were obtained for 48 images (different views of the different objects) presented in 3 backgrounds. An example of one of the backgrounds containing one view of each of four objects is illustrated in Figure <xref ref-type="fig" rid="F3">3C</xref>.</p></caption>
<graphic xlink:href="fncom-08-00085-g0004.tif"/>
</fig>
</sec>
<sec>
<title>3.2. Tests of VisNet on view and translation invariance</title>
<p>Although VisNet had been trained on a 25-location grid with size 64 &#x000D7; 64 with spacing of 16 pixels, and with 4 different views of each object, we did not know how well VisNet would perform on this task as this has never been tested before, nor whether performance would generalize to intermediate locations in the 64 &#x000D7; 64 grid, given that there were only 25 training locations spaced 16 pixels apart. An analysis is shown in Figure <xref ref-type="fig" rid="F5">5A</xref> which covers the 4096 locations in the 64 &#x000D7; 64 grid. This indicates that the performance (on the view invariant object recognition) peaks at the trained locations (0, 16, and 32 in this Figure), but also that there is reasonable performance at intermediate locations between the training locations. (The chance performance with 4 objects is 25% correct.) This is an important new result, which adds to previous evidence that smaller versions of VisNet with 32 &#x000D7; 32 neurons in each of 4 layers can generalize reasonably across intermediate untrained locations in scenes with blank backgrounds (Wallis and Rolls, <xref ref-type="bibr" rid="B130">1997</xref>). The performance was measured with a pattern associator trained on layer 4 of VisNet, with four output neurons (one for each object), and the 25 most selective cells for each object identified using the single cell information measure (see Appendix). The best cells were quite selective for one of the objects, and quite invariant in their response over the 100 transforms (4 views and 25 locations), as illustrated in Figure <xref ref-type="fig" rid="F5">5B</xref>.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p><bold>(A)</bold> The performance on the view invariant object recognition tested with images at the 15 trained locations on the 64 &#x000D7; 64 training grid, and at intermediate locations. The ordinate shows the distance from the central line in the training grid, and trained locations thus correspond to offsets of 0, 16, and 32. The mean and standard deviation are shown for each data point. The standard deviation was measured by performing the training ten times each with a different random seed to generate the connectivity of VisNet. Performance decreases beyond an offset of 32, because there was no translation invariant training beyond this. <bold>(B)</bold> A neuron in layer 4 of VisNet that responded to almost all transforms of one object (4), and to no transform of any other object (1&#x02013;3). There were 25 location transforms on a grid of size 64 with a spacing of 16, and 4 views of each object at each location. The stimulus-specific information or surprise was 2 bits, as there were 4 objects.</p></caption>
<graphic xlink:href="fncom-08-00085-g0005.tif"/>
</fig>
</sec>
<sec>
<title>3.3. Tests of the whole saliency plus view invariance system</title>
<p>With 48 images extracted from the the 12 test scenes (8 illustrated in Figure <xref ref-type="fig" rid="F3">3A</xref>), performance was 90% correct (43 correct/48), where chance with the four objects is 25% (Fisher test <italic>p</italic> &#x000AB; 0.0001).</p>
<p>It is important that this good performance on this identification task was found when the images extracted for presentation to VisNet had background parts of the scene included (e.g., Figure <xref ref-type="fig" rid="F3">3C</xref>). These background features did not produce large decreases in the performance of VisNet, given that VisNet had been trained on the objects but not on the backgrounds (Stringer and Rolls, <xref ref-type="bibr" rid="B114">2000</xref>). This is important for the processes of invariant visual object identification in novel complex natural scenes described here. Further, if there was a low amplitude saliency peak containing only part of the background scene and not an object, then VisNet did not respond to this as a trained object. When errors were made by VisNet on the object identification, the confusions were as frequent between the classes of people and vehicle as within these classes.</p>
</sec>
<sec>
<title>3.4. Tests of view plus translation invariance at intermediate views</title>
<p>The training images had four views of each object separated by 45&#x000B0; as illustrated in Figure <xref ref-type="fig" rid="F2">2</xref>. To assess whether these views were sufficiently close to allow for generalization between the trained views, we tested VisNet with 6 intermediate views (presented on plain backgrounds) between each trained view. As shown in Figure <xref ref-type="fig" rid="F6">6</xref>, performance is reasonable at the untrained intermediate views. The important implication is that VisNet does not need to be trained on a large set of closely spaced views, and this helps the rapid learning of new objects, and also may help to increase the capacity of VisNet, as only few views of each new object need to be learned.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p><bold>Performance of VisNet at views intermediate to the trained views of 270, 315, 0, and 45&#x000B0;, which are indicated by T</bold>. Performance was tested at 6 intermediate views between each trained view, and then for illustrative purposes the results for the 6 intermediate views were averaged using adjacent views. Each data point shown is the average of 12 observations. The chance level of performance, 25%, is indicated.</p></caption>
<graphic xlink:href="fncom-08-00085-g0006.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>4. Discussion</title>
<p>By combining in a simulation the operation of the dorsal and ventral visual systems in the identification of objects in complex natural scenes, we believe that important progress has been made, in a biologically inspired approach not attempted in other including computer-based approaches. The models simulated show how the brain may solve this major computational problem by moving the eyes to fixate close to objects in a natural scene using bottom-up saliency implemented in the dorsal visual system, and then performs objects recognition successively for each of the fixated regions using the ventral visual system. The research described here emphasizes that because the eyes do not locate the center of objects based on saliency, then translation invariance as well as view, size etc invariance needs to be implemented in the ventral visual system. We show how a model of invariant object recognition in the ventral visual system, VisNet, can perform the required combination of translation and view invariant recognition, and moreover can generalize between views of objects that are 45&#x000B0; apart during training, and can also generalize to intermediate locations when trained in a coarse training grid with the spacing between trained locations equivalent to 1&#x02013;3&#x000B0;.</p>
<p>We emphasize that the model is closely linked to neurophysiological research on visual object recognition in natural scenes, and explicitly models how the system could operate computationally to achieve the degree of translation invariance shown in complex natural scenes by inferior temporal cortex neurons (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>; Aggelopoulos and Rolls, <xref ref-type="bibr" rid="B4">2005</xref>) as well as the view invariance that is combined with this (Hasselmo et al., <xref ref-type="bibr" rid="B42">1989</xref>; Booth and Rolls, <xref ref-type="bibr" rid="B12">1998</xref>). Moreover, the deformation or pose invariance that can be shown by inferior temporal cortex neurons is also a property that can be learned by this functional architectural computational model of object recognition in the ventral visual system, VisNet (Webb and Rolls, <xref ref-type="bibr" rid="B133">2014</xref>).</p>
<p>We note that in the underlying neurophysiological experiments, the objects were small and were presented in an unstructured scene, which was the leaves of trees (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>). In this type of scene, objects can only be found by repeated saccades round the scene until the eyes become sufficiently close for the object to fall within the inferior temporal visual cortex neuronal receptive fields which become dynamically reduced to a few degrees in such scenes (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>). The receptive fields of inferior temporal cortex neurons are thus small, a few degrees, in complex natural scenes (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>; Aggelopoulos and Rolls, <xref ref-type="bibr" rid="B4">2005</xref>). In previous research, sometimes large receptive fields have been reported (Gross et al., <xref ref-type="bibr" rid="B39">1969</xref>), and sometimes small, a few degrees (Op de Beeck and Vogels, <xref ref-type="bibr" rid="B69">2000</xref>; DiCarlo and Maunsell, <xref ref-type="bibr" rid="B24">2003</xref>). We showed that an important factor in the receptive field size is the background. If the receptive fields are measured as in traditional visual neurophysiology against a blank background, then the receptive fields can be as large as 70&#x000B0;, whereas in a complex cluttered natural scene the receptive fields can be as small as a few degrees (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>). Moreover, we went on to show that the underlying dynamical mechanism for receptive field size adjustment is probably competition between neurons operating with neurons that have more input from objects close to the fovea (Trappenberg et al., <xref ref-type="bibr" rid="B124">2002</xref>). If objects can be recognized by humans rapidly without the need for multiple fixations round the scene (Thorpe, <xref ref-type="bibr" rid="B119">2009</xref>), then one has to assume that the scene has properties including probably some structure or contrast or color or other low-level feature (Crouzet and Thorpe, <xref ref-type="bibr" rid="B15">2011</xref>), that enables the object to pop out using lower-level processing that does not engage the invariant representations provided by inferior temporal cortex neurons (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>).</p>
<p>The operation of VisNet coupled with the saliency model of the dorsal visual system described here for the identification of multiple objects at different positions in a natural scene with view invariance is now compared with that of other systems and approaches. First, VisNet provides a theory and model of how object identification with view (Stringer and Rolls, <xref ref-type="bibr" rid="B115">2002</xref>), size (Wallis and Rolls, <xref ref-type="bibr" rid="B130">1997</xref>), isomorphic rotation, translation (Stringer and Rolls, <xref ref-type="bibr" rid="B114">2000</xref>; Perry et al., <xref ref-type="bibr" rid="B73">2010</xref>), contrast, illumination (Rolls and Stringer, <xref ref-type="bibr" rid="B97">2006</xref>), and spatial frequency invariance is performed in the cerebral cortex (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>). The approach is addressing fundamental issues about how the cerebral cortex functions. VisNet models four stages of visual processing beyond V1, and simulates V1; it uses local, biologically plausible, synaptic learning rules; it produces neurons in its layer 4 that are comparable to neurons recorded in the inferior temporal visual cortex (IT) (Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>; Rolls, <xref ref-type="bibr" rid="B85">2012</xref>) in terms of their receptive fields and how they are influenced by multiple items in a scene and by top-down attention (Trappenberg et al., <xref ref-type="bibr" rid="B124">2002</xref>; Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>); in terms of the neuronal tuning to different objects (though VisNet has somewhat more binary neurons that IT neurons) (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>); and in terms of size, view, translation, spatial frequency, and contrast invariance (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>). We know of no other biologically plausible model that performs view invariant as well as other types of transform invariant object identification, and that can do this with multiple different objects in complex natural scenes, as demonstrated here.</p>
<p>We provide now (following a suggestion) an account of how VisNet is able to solve the type of invariant object recognition problem described here when an image is presented to it, with more detailed accounts available elsewhere (Wallis and Rolls, <xref ref-type="bibr" rid="B130">1997</xref>; Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>). VisNet is a 4-layer network with feedforward convergence from stage to stage that enables the small receptive fields present in its V1-like Gabor filter inputs of approximately 1&#x000B0; to increase in size so that by the fourth layer a single neuron can potentially receive input from all parts of the input space (Figure <xref ref-type="fig" rid="F1">1</xref>). The feedforward connections between layers are trained by competitive learning, which is an unsupervised form of learning (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>), that allows neurons to learn to respond to feature combinations. As one proceeds up though the hierarchy, the feature combinations become combinations of feature combinations (see Rolls, <xref ref-type="bibr" rid="B84">2008</xref> Figure 4.20 and Elliffe et al., <xref ref-type="bibr" rid="B28">2002</xref>). Local lateral inhibition within each layer allows each local area within a layer to respond to and learn whatever is present in that local region independently of how much information and contrast there may be in other parts of a layer, and this, together with the non-linear activation function of the neurons, enables a sparse distributed representation to be produced. In the sparse distributed representation, a small proportion of neurons is active at a high rate for the input being presented, and most of the neurons are close to their spontaneous rate, and this makes the neurons of VisNet (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>) very similar to those recorded in the visual system (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>). A key property of VisNet is the way that it learns whatever can be learned at every stage of the network that is invariant as an image transforms in the natural world, using the temporal trace learning rule. This learning rule enables the firing from the preceding few items to be maintained, and given the temporal statistics of visual inputs, these inputs are likely to be from the same object. (Typically primates including humans look at one object for a short period during which it may transform by translation, size, isomorphic rotation, and/or view, and all these types of transform can therefore be learned by VisNet.) Effectively, VisNet uses as a teacher the temporal and spatial continuity of objects as they transform in the world to learn invariant representations. (An interesting example is that representations of individual people or objects invariant with respect to pose (e.g., standing, sitting, walking) can be learned by VisNet, or representations of pose invariant with respect to the individual person or object can be learned by VisNet depending on the order in which the identical images are presented during training Webb and Rolls, <xref ref-type="bibr" rid="B133">2014</xref>.) Indeed, we developed these hypotheses (Rolls, <xref ref-type="bibr" rid="B80">1992</xref>, <xref ref-type="bibr" rid="B81">1995</xref>, <xref ref-type="bibr" rid="B85">2012</xref>; Wallis et al., <xref ref-type="bibr" rid="B131">1993</xref>) into a model of the ventral visual system that can account for translation, size, view, lighting, and rotation invariance (Wallis and Rolls, <xref ref-type="bibr" rid="B130">1997</xref>; Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>; Stringer and Rolls, <xref ref-type="bibr" rid="B114">2000</xref>, <xref ref-type="bibr" rid="B115">2002</xref>, <xref ref-type="bibr" rid="B116">2008</xref>; Rolls and Stringer, <xref ref-type="bibr" rid="B96">2001</xref>, <xref ref-type="bibr" rid="B97">2006</xref>, <xref ref-type="bibr" rid="B98">2007</xref>; Elliffe et al., <xref ref-type="bibr" rid="B28">2002</xref>; Perry et al., <xref ref-type="bibr" rid="B72">2006</xref>, <xref ref-type="bibr" rid="B73">2010</xref>; Stringer et al., <xref ref-type="bibr" rid="B113">2006</xref>, <xref ref-type="bibr" rid="B117">2007</xref>; Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>). Consistent with the hypothesis, we have demonstrated these types of invariance (and spatial frequency invariance) in the responses of neurons in the macaque inferior temporal visual cortex (Rolls et al., <xref ref-type="bibr" rid="B92">1985</xref>, <xref ref-type="bibr" rid="B91">1987</xref>, <xref ref-type="bibr" rid="B88">2003</xref>; Rolls and Baylis, <xref ref-type="bibr" rid="B89">1986</xref>; Hasselmo et al., <xref ref-type="bibr" rid="B42">1989</xref>; Tovee et al., <xref ref-type="bibr" rid="B122">1994</xref>; Booth and Rolls, <xref ref-type="bibr" rid="B12">1998</xref>). Moreover, we have tested the hypothesis by placing small 3D objects in the macaque&#x00027;s home environment, and showing that in the absence of any specific rewards being delivered, this type of visual experience in which objects can be seen from different views as they transform continuously in time to reveal different views leads to single neurons in the inferior temporal visual cortex that respond to individual objects from any one of several different views, demonstrating the development of view-invariance learning (Booth and Rolls, <xref ref-type="bibr" rid="B12">1998</xref>). (In control experiments, view invariant representations were not found for objects that had not been viewed in this way.) The learning shown by neurons in the inferior temporal visual cortex can take just a small number of trials (Rolls et al., <xref ref-type="bibr" rid="B90">1989</xref>). The finding that temporal contiguity in the absence of reward is sufficient to lead to view invariant object representations in the inferior temporal visual cortex has been confirmed (Li and DiCarlo, <xref ref-type="bibr" rid="B58">2008</xref>, <xref ref-type="bibr" rid="B59">2010</xref>, <xref ref-type="bibr" rid="B60">2012</xref>). The importance of temporal continuity in learning invariant representations has also been demonstrated in human psychophysics experiments (Perry et al., <xref ref-type="bibr" rid="B72">2006</xref>; Wallis, <xref ref-type="bibr" rid="B129">2013</xref>). Some other simulation models are also adopting the use of temporal continuity as a guiding principle for developing invariant representations by learning (Wiskott and Sejnowski, <xref ref-type="bibr" rid="B135">2002</xref>; Wiskott, <xref ref-type="bibr" rid="B134">2003</xref>; Wyss et al., <xref ref-type="bibr" rid="B136">2006</xref>; Franzius et al., <xref ref-type="bibr" rid="B33">2007</xref>), and the temporal trace learning principle has also been applied recently (Isik et al., <xref ref-type="bibr" rid="B47">2012</xref>) to HMAX (Riesenhuber and Poggio, <xref ref-type="bibr" rid="B78">2000</xref>; Serre et al., <xref ref-type="bibr" rid="B109">2007c</xref>).</p>
<p>We now compare this VisNet approach to invariant object recognition to some other approaches that seek to be biologically plausible. One such approach is HMAX (Riesenhuber and Poggio, <xref ref-type="bibr" rid="B78">2000</xref>; Serre et al., <xref ref-type="bibr" rid="B107">2007a</xref>,<xref ref-type="bibr" rid="B108">b</xref>,<xref ref-type="bibr" rid="B109">c</xref>; Mutch and Lowe, <xref ref-type="bibr" rid="B66">2008</xref>), which is a hierarchical feedforward network with alternating simple cell-like (S) and complex cell-like (C) layers. The simple cell-like layers respond to a similarity function of the firing rates of the input neuron to the synaptic weights of the receiving neuron (used as an alternative to the more usual dot product), and the complex cells to the maximum input that they receive from a particular class of simple cell in the preceding layer. The classes of simple cell are set to respond maximally to a random patch of a training image (by presenting the image, and setting the synaptic weights of the S cells to be the firing rates of the cells from it receives), and are propagated laterally, that is there are exact copies throughout a layer, which is of course a non-local operation and not biologically plausible. The hierarchy receives inputs from Gabor-like filters (which is like VisNet). The result of this in HMAX is that in the hierarchy there is no learning of invariant representations of objects; and that the output firing in the final C layer (for example the second C layer in a four-layer S1-C1-S2-C2 hierarchy) is high for almost all neurons to most stimuli, with almost no invariance represented in the output layer of the hierarchy, in that two different views of the same object may be as different as a view of another object, measured using the responses of a single neuron or of all the neurons (Robinson and Rolls, <xref ref-type="bibr" rid="B79">2014</xref>). The neurons in the output C layer are thus quite unlike those in VisNet or in the inferior temporal cortex, where there is a sparse distributed representation, and where single cells convey much information in their firing rates, and populations of single cells convey much information that can be decoded by biologically plausible dot product decoding such as might be performed by a pattern association network in the areas that receive from the inferior temporal visual cortex, such as the orbitofrontal cortex and amygdala (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>). HMAX therefore must resort to a very powerful classification algorithm, in practice typically a Support Vector Machine (SVM), which is not biologically plausible, to learn to classify all the outputs of the final layer that are produced by the different transforms of one object to be of the same object, and different to those of other objects. Thus HMAX does not learn invariant representations by its output layer of the S&#x02013;C hierarchy, but instead uses a SVM to perform the classification that the SVM is taught. This is completely unlike the output of VisNet and of inferior temporal cortex neuron firing, which by responding very similarly in terms of firing rate to the different transforms of an object show that the invariance has been learned in the hierarchy (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>). Another way that the output of HMAX may be assessed is by the use of View-Tuned Units (VTUs), each of which is set to respond to one view of a class or object by setting its synaptic weights from each C unit to the value of the firing of the C unit to one view or exemplar of the object or class (Serre et al., <xref ref-type="bibr" rid="B108">2007b</xref>). Because there is little invariance in the C units, many different VTUs are needed, with one for each training view or exemplar. Because the VTUs are different to each other for the different views of the same object or class, a further stage of training is then needed to classify the VTUs into object classes, and the type of learning is least squares error minimization (Serre et al., <xref ref-type="bibr" rid="B108">2007b</xref>), equivalent to a delta-rule one-layer perceptron which again is not biologically plausible for neocortex (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>). Thus HMAX does not generate invariant representations in its S&#x02013;C hierarchy, and in the VTU approach uses two layers of learning after the S&#x02013;C hierarchy, the second involving least squares learning, to produce classification. This is unlike VisNet, which learns invariant representations in its hierarchy, and produces view invariant neurons (similar to those for faces (Hasselmo et al., <xref ref-type="bibr" rid="B42">1989</xref>) and objects (Booth and Rolls, <xref ref-type="bibr" rid="B12">1998</xref>) in the inferior temporal visual cortex) that can be read by a biologically plausible pattern associator (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>).</p>
<p>Another difference of HMAX from VisNet is in the way that VisNet is trained, which is a fundamental aspect of the VisNet approach. HMAX has traditionally been tested with benchmarking databases such as the CalTech-101 and CalTech-256 (Griffin et al., <xref ref-type="bibr" rid="B38">2007</xref>) in which sets of images from different categories are to be classified. The Caltech-256 dataset is comprised of 256 object classes made up of images that have many aspect ratios, sizes and differ quite significantly in quality (having being manually collated from web searches). The objects within the images show significant intra-class variation and have a variety of poses, illumination, scale and occlusion as expected from natural images. A network is supposed to classify these correctly into classes such as hats and bears (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>; Robinson and Rolls, <xref ref-type="bibr" rid="B79">2014</xref>). The problem is that examples of each class of object transforming continuously though different positions on the retina, size, isomorphic rotation, and view are not provided to help the system learn about how a given type of object transforms in the world. The system just has to try to classify based on a set of often quite different exemplars that are not transforms of each other. Thus a system trained in this way is greatly hindered in generating transform invariant representations by the end of the hierarchy, and such a system has to rely on a powerful classifier such as a SVM to perform a classification that is not based on transform invariance learned in the hierarchical network. In contrast, VisNet is provided during training with systematic transforms of objects of the type that would be seen as objects transform in the world, and has a well-posed basis for learning invariant representations. It is important that with VisNet, the early layers may learn what types of transform can be produced in small parts of the visual field by different classes of object, so that when a new class of object is introduced, rapid learning in the last layer and generalization to untrained views can occur without the need for further training of the early layers (Stringer and Rolls, <xref ref-type="bibr" rid="B115">2002</xref>).</p>
<p>Some other approaches to biologically plausible invariant object recognition are being developed with hierarchies that may be allowed unsupervised learning (Pinto et al., <xref ref-type="bibr" rid="B74">2009</xref>; DiCarlo et al., <xref ref-type="bibr" rid="B25">2012</xref>; Yamins et al., <xref ref-type="bibr" rid="B137">2014</xref>). For example, a hierarchical network has been trained with unsupervised learning, and with many transforms of each object to help the system to learn invariant representations in an analogous way to that in which VisNet is trained, but the details of the network architecture are selected by finding parameter values for the specification of the network structure that produce good results on a benchmark classification task (Pinto et al., <xref ref-type="bibr" rid="B74">2009</xref>). However, formally these are convolutional networks, so that the neuronal filters for one local region are replicated over the whole of visual space, which is computationally efficient but biologically implausible. Further, a general linear model is used to decode the firing in the output level of the model to assess performance, so it is not clear whether the firing rate representations of objects in the output layer of the model are very similar to that of the inferior temporal visual cortex. In contrast, with VisNet (Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>; Rolls, <xref ref-type="bibr" rid="B85">2012</xref>) the information measurement procedures that we use (Rolls et al., <xref ref-type="bibr" rid="B104">1997a</xref>,<xref ref-type="bibr" rid="B105">b</xref>) are the same as those used to measure the representation that is present in the inferior temporal visual cortex (Tovee et al., <xref ref-type="bibr" rid="B123">1993</xref>; Rolls and Tovee, <xref ref-type="bibr" rid="B100">1995</xref>; Tovee and Rolls, <xref ref-type="bibr" rid="B121">1995</xref>; Abbott et al., <xref ref-type="bibr" rid="B1">1996</xref>; Baddeley et al., <xref ref-type="bibr" rid="B6">1997</xref>; Rolls et al., <xref ref-type="bibr" rid="B104">1997a</xref>,<xref ref-type="bibr" rid="B105">b</xref>, <xref ref-type="bibr" rid="B87">2004</xref>, <xref ref-type="bibr" rid="B94">2006</xref>; Panzeri et al., <xref ref-type="bibr" rid="B70">1999</xref>; Treves et al., <xref ref-type="bibr" rid="B125">1999</xref>; Franco et al., <xref ref-type="bibr" rid="B32">2004</xref>, <xref ref-type="bibr" rid="B31">2007</xref>; Aggelopoulos et al., <xref ref-type="bibr" rid="B3">2005</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>).</p>
<p>We turn next to compare the operation of VisNet, as a model of cerebral cortical mechanisms involved in view-invariant object identification, with artificial, computer vision, approaches to object identification. However, we do emphasize that our aim in the present research is to investigate how the cerebral cortex operates in vision, not how computer vision attempts to solve similar problems. Within computer vision, we note that many approaches start with using independent component analysis (ICA) (Kanan, <xref ref-type="bibr" rid="B50">2013</xref>), sparse coding (Kanan and Cottrell, <xref ref-type="bibr" rid="B51">2010</xref>), and other mathematical approaches (Larochelle and Hinton, <xref ref-type="bibr" rid="B55">2010</xref>) to derive what may be suitable &#x0201C;feature analyzers,&#x0201D; which are frequently compared to the responses of V1 neurons. Computer vision approaches to object identification then may take combinations of these feature analyzers, and perform statistical analyses using computer-based algorithms that are not biologically plausible such as Restricted Boltzmann Machines (RBMs) on these primitives to statistically discriminate different objects (Larochelle and Hinton, <xref ref-type="bibr" rid="B55">2010</xref>). Such a system does not learn view invariant object recognition, for the different views of an object may have completely different statistics of the visual primitives, yet are the different views of the same object. (Examples might include frontal and profile views of faces, which are well tolerated for individual recognition by some inferior temporal cortex neurons (Hasselmo et al., <xref ref-type="bibr" rid="B42">1989</xref>); very different views of 3D object which are identified correctly as the same object by IT neurons after visual experience with the objects to allow for view-invariant learning (Booth and Rolls, <xref ref-type="bibr" rid="B12">1998</xref>); and many man-made tools and objects which may appear quite different in 2D image properties from different views.) Part of the difficulty of computer vision lay in attempts to parse a whole scene at one time (Marr, <xref ref-type="bibr" rid="B62">1982</xref>). However, the biological approach is to place the fovea on one part of a scene, perform image analysis/object identification there, and then move the eyes to fixate a different location in a scene (Trappenberg et al., <xref ref-type="bibr" rid="B124">2002</xref>; Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>). This is a divide-and-conquer strategy used by the real visual system, to simplify the computational problem into smaller parts performed successively, to simplify the representation of multiple objects in a scene, and to facilitate passing the coordinates of a target object for action by using the coordinates of the object being fixated (Ballard, <xref ref-type="bibr" rid="B7">1990</xref>; Rolls and Deco, <xref ref-type="bibr" rid="B93">2002</xref>; Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>; Aggelopoulos and Rolls, <xref ref-type="bibr" rid="B4">2005</xref>; Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>). This approach has now been adopted by some computer vision approaches (Denil et al., <xref ref-type="bibr" rid="B20">2012</xref>).</p>
<p>Important issues are raised for future research.</p>
<p>First, how well does this approach scale up? At present there are 128 &#x000D7; 128 neurons in each of 4 layers of VisNet, that is 65,536 neurons. This is small compared to the number of neurons in the ventral visual stream, which number tens of millions of neurons (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>). If this is indeed a good model of the processing in the ventral visual system, as we hypothesize and on which VisNet is based (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>), then the system should scale up appropriately, that is, probably linearly. There are a number of different aspects that need to scale up. One is the number of objects that can be trained. A second is the number of views that can be trained. A third is the number of locations in which the system is trained, both because saliency mechanisms are not as accurate as the range of 32 pixels from the fovea over which we trained here (Figure <xref ref-type="fig" rid="F4">4</xref>), and because it may be advantageous to train at intermediate locations (Figure <xref ref-type="fig" rid="F5">5</xref>). We propose to scale up VisNet by 16 times, from 128 &#x000D7; 128 neurons per layer to 512 &#x000D7; 512 neurons per layer, and to simultaneously address all these issues.</p>
<p>Second, we have used a generically sound and well-known approach to bottom-up saliency, an approach developed by Koch, Itti, Harel and colleagues (Itti and Koch, <xref ref-type="bibr" rid="B48">2000</xref>; Harel et al., <xref ref-type="bibr" rid="B40">2006a</xref>,<xref ref-type="bibr" rid="B41">b</xref>). However, it is possible to tune saliency algorithms so that they are more likely to detect objects of certain classes, such as faces or cars. This may greatly increase the capability of the approach described here, and we plan to test how much improvement in performance for the detection and then identification of certain classes of objects can be obtained by incorporating more specialized saliency algorithms. Many saliency approaches and algorithms that are of interest for future research are available (Bruce and Tsotsos, <xref ref-type="bibr" rid="B13">2006</xref>; Achanta et al., <xref ref-type="bibr" rid="B2">2008</xref>; Zhang et al., <xref ref-type="bibr" rid="B138">2008</xref>; Kootstra et al., <xref ref-type="bibr" rid="B54">2010</xref>; Goferman et al., <xref ref-type="bibr" rid="B36">2012</xref>; Riche et al., <xref ref-type="bibr" rid="B77">2012</xref>; Jia et al., <xref ref-type="bibr" rid="B49">2013</xref>; Li et al., <xref ref-type="bibr" rid="B57">2013</xref>). For example, contextual information may be useful, such as the fact that sofas are not usually found in the sky, and that people are usually tall, skinny objects on the ground (though see Webb and Rolls, <xref ref-type="bibr" rid="B133">2014</xref>), and contextual guidance models have been combined with bottom-up saliency models (Oliva and Torralba, <xref ref-type="bibr" rid="B68">2006</xref>; Torralba et al., <xref ref-type="bibr" rid="B120">2006</xref>; Ehinger et al., <xref ref-type="bibr" rid="B26">2009</xref>; Kanan et al., <xref ref-type="bibr" rid="B52">2009</xref>). We emphasize that in the system described here, only one fixation is assumed for each object in a scene, consistent with the fact that single neurons in the inferior temporal visual cortex provide sufficient information for object and face identification during a single fixation and in only 20&#x02013;50 ms of neuronal firing, as shown by information theoretic analyses of neuronal activity and by backward masking (Rolls et al., <xref ref-type="bibr" rid="B101">1994</xref>; Rolls and Tovee, <xref ref-type="bibr" rid="B99">1994</xref>; Tovee and Rolls, <xref ref-type="bibr" rid="B121">1995</xref>). [More detailed information may become available with repeated fixations on different parts of an object, and this has been investigated in computer vision (Barrington et al., <xref ref-type="bibr" rid="B8">2008</xref>; Kanan and Cottrell, <xref ref-type="bibr" rid="B51">2010</xref>; Larochelle and Hinton, <xref ref-type="bibr" rid="B55">2010</xref>).]</p>
<p>Third, we have not utilized top-down attention in the developments described here. Top-down attention, whereby an object or set of objects is held active in a short term memory which biases the competitive networks in VisNet, can in principle improve performance considerably (Rolls and Deco, <xref ref-type="bibr" rid="B93">2002</xref>; Deco and Rolls, <xref ref-type="bibr" rid="B17d">2005b</xref>; Rolls, <xref ref-type="bibr" rid="B84">2008</xref>). Indeed, we have developed and successfully tested a reduced version of VisNet in which top-down attention does facilitate processing (Deco and Rolls, <xref ref-type="bibr" rid="B17">2004</xref>), and this approach has also been used in computer vision (Walther et al., <xref ref-type="bibr" rid="B132">2002</xref>). Another type of top-down effect is that task requirements can influence fixations in a scene (Hayhoe and Ballard, <xref ref-type="bibr" rid="B44">2005</xref>). We plan in future to incorporate top-down attention into the full, current, version of VisNet, to investigate how this is likely to improve performance, especially for certain selected classes of object.</p>
<p>Fourth, it will be useful to investigate in future the incorporation of more powerful synaptic learning rules when training with the large number of transforms needed when learning invariance for both view and translation transforms of objects. With VisNet, we have so far used an associative (Hebbian) synaptic modification rule (with a trace of previous firing in the postsynaptic term), for biological plausibility (Rolls, <xref ref-type="bibr" rid="B85">2012</xref>). However, to explore further the potential of the overall architecture of VisNet, it will be of interest to investigate how much performance improves when error correction of the post-synaptic firing with respect to the trace of previous neuronal activity is incorporated to implement gradient descent. Gradient descent (Einhauser et al., <xref ref-type="bibr" rid="B27">2005</xref>; Wyss et al., <xref ref-type="bibr" rid="B136">2006</xref>) or optimized slow learning (Wiskott and Sejnowski, <xref ref-type="bibr" rid="B135">2002</xref>; Wiskott, <xref ref-type="bibr" rid="B134">2003</xref>) have been found useful with different architectures.</p>
<p>Fifth, if a strong saliency peak occurs due to something in the background scene that is close to an object, or due to another trained object, how will the system respond? We suggest that the general answer is that the asymmetry that is present in the receptive fields of inferior temporal cortex neurons in cluttered scenes (Aggelopoulos and Rolls, <xref ref-type="bibr" rid="B4">2005</xref>) that is related to the asymmetries caused by the sparse probabilistic forward connections of each neuron (Rolls et al., <xref ref-type="bibr" rid="B106">2008</xref>) and that enables two instances of the same object close together to be correctly identified in terms of both object and position (Rolls et al., <xref ref-type="bibr" rid="B106">2008</xref>) provides the solution, but it will be of interest to investigate this in detail.</p>
<p>Part of the value of the research described here is that it tests, and investigates the operation of, a theory of how view invariant object identification could be implemented by the cerebral cortex. Some predictions of the simulations are (1) that learning will need to be part of the process involved in view-invariant object identification, as the views of an object can be very different; (2) that for at least views of people, a few well-spaced views (we used 45&#x000B0;) should suffice; (3) that translation invariance in complex unstructured crowded scenes may need to be over just a few degrees, for fixation guided by bottom-up saliency has precision of that order at least for the types of object considered here, and repeated saccades are necessary to reach sufficiently close to an object in a large scene for the invariance available to be able to operate in object identification (Rolls et al., <xref ref-type="bibr" rid="B88">2003</xref>; Aggelopoulos and Rolls, <xref ref-type="bibr" rid="B4">2005</xref>); and (4) that just a single fixation of each object will in general suffice for object/person identification, because of the speed of cortical processing (Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>; Rolls, <xref ref-type="bibr" rid="B85">2012</xref>).</p>
<sec>
<title>Conflict of interest statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec>
</sec>
</body>
<back>
<ack>
<p>The authors acknowledge with thanks the use of the GBVS software (Harel et al., <xref ref-type="bibr" rid="B40">2006a</xref>,<xref ref-type="bibr" rid="B41">b</xref>) (<ext-link ext-link-type="uri" xlink:href="http://www.vision.caltech.edu/~harel/share/gbvs.php">http://www.vision.caltech.edu/~harel/share/gbvs.php</ext-link>). The images shown in Figure <xref ref-type="fig" rid="F2">2</xref> were created with Blender from models available at <ext-link ext-link-type="uri" xlink:href="http://www.blendswap.com">www.blendswap.com</ext-link>, and acknowledged as follows: truck&#x02014;Opel Blitz by orokrhus; jeep by Jay-Artist; woman by Gerardus. The man was generated using MakeHuman available at <ext-link ext-link-type="uri" xlink:href="http://www.makehuman.org">www.makehuman.org</ext-link>.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abbott</surname> <given-names>L. F.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Tovee</surname> <given-names>M. J.</given-names></name></person-group> (<year>1996</year>). <article-title>Representational capacity of face coding in monkeys</article-title>. <source>Cereb. Cortex</source> <volume>6</volume>, <fpage>498</fpage>&#x02013;<lpage>505</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/6.3.498</pub-id><pub-id pub-id-type="pmid">8670675</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Achanta</surname> <given-names>R.</given-names></name> <name><surname>Estrada</surname> <given-names>F.</given-names></name> <name><surname>Wils</surname> <given-names>P.</given-names></name> <name><surname>S&#x000FC;sstrunk</surname> <given-names>S.</given-names></name></person-group> (<year>2008</year>). <article-title>Salient region detection and segmentation</article-title>. <source>Comput. Vis. Syst</source>. <volume>5008</volume>, <fpage>66</fpage>&#x02013;<lpage>75</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-540-79547-6_7</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aggelopoulos</surname> <given-names>N. C.</given-names></name> <name><surname>Franco</surname> <given-names>L.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2005</year>). <article-title>Object perception in natural scenes: encoding by inferior temporal cortex simultaneously recorded neurons</article-title>. <source>J. Neurophysiol</source>. <volume>93</volume>, <fpage>1342</fpage>&#x02013;<lpage>1357</lpage>. <pub-id pub-id-type="doi">10.1152/jn.00553.2004</pub-id><pub-id pub-id-type="pmid">15496489</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aggelopoulos</surname> <given-names>N. C.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2005</year>). <article-title>Natural scene perception: inferior temporal cortex neurons encode the positions of different objects in the scene</article-title>. <source>Eur. J. Neurosci</source>. <volume>22</volume>, <fpage>2903</fpage>&#x02013;<lpage>2916</lpage>. <pub-id pub-id-type="doi">10.1111/j.1460-9568.2005.04487.x</pub-id><pub-id pub-id-type="pmid">16324125</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arcizet</surname> <given-names>F.</given-names></name> <name><surname>Mirpour</surname> <given-names>K.</given-names></name> <name><surname>Bisley</surname> <given-names>J. W.</given-names></name></person-group> (<year>2011</year>). <article-title>A pure salience response in posterior parietal cortex</article-title>. <source>Cereb. Cortex</source> <volume>21</volume>, <fpage>2498</fpage>&#x02013;<lpage>2506</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/bhr035</pub-id><pub-id pub-id-type="pmid">21422270</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baddeley</surname> <given-names>R. J.</given-names></name> <name><surname>Abbott</surname> <given-names>L. F.</given-names></name> <name><surname>Booth</surname> <given-names>M. J. A.</given-names></name> <name><surname>Sengpiel</surname> <given-names>F.</given-names></name> <name><surname>Freeman</surname> <given-names>T.</given-names></name> <name><surname>Wakeman</surname> <given-names>E. A.</given-names></name> <etal/></person-group>. (<year>1997</year>). <article-title>Responses of neurons in primary and inferior temporal visual cortices to natural scenes</article-title>. <source>Proc. R. Soc. B</source> <volume>264</volume>, <fpage>1775</fpage>&#x02013;<lpage>1783</lpage>. <pub-id pub-id-type="doi">10.1098/rspb.1997.0246</pub-id><pub-id pub-id-type="pmid">9447735</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ballard</surname> <given-names>D. H.</given-names></name></person-group> (<year>1990</year>). <article-title>Animate vision uses object-centred reference frames</article-title>, in <source>Advanced Neural Computers</source>, ed <person-group person-group-type="editor"><name><surname>Eckmiller</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>North-Holland, Amsterdam</publisher-loc>: <publisher-name>Elsevier</publisher-name>), <fpage>229</fpage>&#x02013;<lpage>236</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barrington</surname> <given-names>L.</given-names></name> <name><surname>Marks</surname> <given-names>T. K.</given-names></name> <name><surname>Hsiao</surname> <given-names>J. H.</given-names></name> <name><surname>Cottrell</surname> <given-names>G. W.</given-names></name></person-group> (<year>2008</year>). <article-title>NIMBLE: a kernel density model of saccade-based visual memory</article-title>. <source>J. Vis</source>. <volume>8</volume>:<fpage>17</fpage>. <pub-id pub-id-type="doi">10.1167/8.14.17</pub-id><pub-id pub-id-type="pmid">19146318</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baylis</surname> <given-names>G. C.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Leonard</surname> <given-names>C. M.</given-names></name></person-group> (<year>1985</year>). <article-title>Selectivity between faces in the responses of a population of neurons in the cortex in the superior temporal sulcus of the monkey</article-title>. <source>Brain Res</source>. <volume>342</volume>, <fpage>91</fpage>&#x02013;<lpage>102</lpage>. <pub-id pub-id-type="doi">10.1016/0006-8993(85)91356-3</pub-id><pub-id pub-id-type="pmid">4041820</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bisley</surname> <given-names>J. W.</given-names></name> <name><surname>Goldberg</surname> <given-names>M. E.</given-names></name></person-group> (<year>2003</year>). <article-title>Neuronal activity in the lateral intraparietal area and spatial attention</article-title>. <source>Science</source> <volume>299</volume>, <fpage>81</fpage>&#x02013;<lpage>86</lpage>. <pub-id pub-id-type="doi">10.1126/science.1077395</pub-id><pub-id pub-id-type="pmid">12511644</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bisley</surname> <given-names>J. W.</given-names></name> <name><surname>Goldberg</surname> <given-names>M. E.</given-names></name></person-group> (<year>2006</year>). <article-title>Neural correlates of attention and distractibility in the lateral intraparietal area</article-title>. <source>J. Neurophysiol</source>. <volume>95</volume>, <fpage>1696</fpage>&#x02013;<lpage>1717</lpage>. <pub-id pub-id-type="doi">10.1152/jn.00848.2005</pub-id><pub-id pub-id-type="pmid">16339000</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Booth</surname> <given-names>M. C. A.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>1998</year>). <article-title>View-invariant representations of familiar objects by neurons in the inferior temporal visual cortex</article-title>. <source>Cereb. Cortex</source> <volume>8</volume>, <fpage>510</fpage>&#x02013;<lpage>523</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/8.6.510</pub-id><pub-id pub-id-type="pmid">9758214</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bruce</surname> <given-names>N. D. B.</given-names></name> <name><surname>Tsotsos</surname> <given-names>J. K.</given-names></name></person-group> (<year>2006</year>). <article-title>Saliency based on information maximization</article-title>, in <source>Advances in Neural Information Processing Systems 18: Proceedings of the 2005 Conference</source>, <volume>Vol. 18</volume> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), 155.</citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Buhmann</surname> <given-names>J.</given-names></name> <name><surname>Lange</surname> <given-names>J.</given-names></name> <name><surname>von der Malsburg</surname> <given-names>C.</given-names></name> <name><surname>Vorbr&#x000FC;ggen</surname> <given-names>J. C.</given-names></name> <name><surname>W&#x000FC;rtz</surname> <given-names>R. P.</given-names></name></person-group> (<year>1991</year>). <article-title>Object recognition in the dynamic link architecture: parallel implementation of a transputer network</article-title>, in <source>Neural Networks for Signal Processing</source>, ed <person-group person-group-type="editor"><name><surname>Kosko</surname> <given-names>B.</given-names></name></person-group> (<publisher-loc>Englewood Cliffs, NJ</publisher-loc>: <publisher-name>Prentice Hall</publisher-name>), <fpage>121</fpage>&#x02013;<lpage>159</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Crouzet</surname> <given-names>S. M.</given-names></name> <name><surname>Thorpe</surname> <given-names>S. J.</given-names></name></person-group> (<year>2011</year>). <article-title>Low-level cues and ultra-fast face detection</article-title>. <source>Front. Psychol</source>. <volume>2</volume>:<issue>342</issue>. <pub-id pub-id-type="doi">10.3389/fpsyg.2011.00342</pub-id><pub-id pub-id-type="pmid">22125544</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Daugman</surname> <given-names>J.</given-names></name></person-group> (<year>1988</year>). <article-title>Complete discrete 2D-Gabor transforms by neural networks for image analysis and compression</article-title>. <source>IEEE Trans. Acoust. Speech Signal Process</source>. <volume>36</volume>, <fpage>1169</fpage>&#x02013;<lpage>1179</lpage>. <pub-id pub-id-type="doi">10.1109/29.1644</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deco</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2004</year>). <article-title>A neurodynamical cortical model of visual attention and invariant object recognition</article-title>. <source>Vision Res</source>. <volume>44</volume>, <fpage>621</fpage>&#x02013;<lpage>644</lpage>. <pub-id pub-id-type="doi">10.1016/j.visres.2003.09.037</pub-id><pub-id pub-id-type="pmid">14693189</pub-id></citation>
</ref>
<ref id="B17c">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deco</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2005a</year>). <article-title>Attention, short term memory, and action selection: a unifying theory</article-title>. <source>Prog. Neurobiol</source>. <volume>76</volume>, <fpage>236</fpage>&#x02013;<lpage>256</lpage>. <pub-id pub-id-type="doi">10.1016/j.pneurobio.2005.08.00</pub-id><pub-id pub-id-type="pmid">16257103</pub-id></citation>
</ref>
<ref id="B17d">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deco</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2005b</year>). <article-title>Neurodynamics of biased competition and cooperation for attention: a model with spiking neurons</article-title>. <source>J. Neurophysiol</source>. <volume>94</volume>, <fpage>295</fpage>&#x02013;<lpage>313</lpage>. <pub-id pub-id-type="doi">10.1152/jn.01095.2004</pub-id><pub-id pub-id-type="pmid">15703227</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Denil</surname> <given-names>M.</given-names></name> <name><surname>Bazzani</surname> <given-names>L.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>de Freitas</surname> <given-names>N.</given-names></name></person-group> (<year>2012</year>). <article-title>Learning where to attend with deep architectures for image tracking</article-title>. <source>Neural Comput</source>. <volume>24</volume>, <fpage>2151</fpage>&#x02013;<lpage>2184</lpage>. <pub-id pub-id-type="pmid">22509964</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Desimone</surname> <given-names>R.</given-names></name> <name><surname>Duncan</surname> <given-names>J.</given-names></name></person-group> (<year>1995</year>). <article-title>Neural mechanisms of selective visual attention</article-title>. <source>Annu. Rev. Neurosci</source>. <volume>18</volume>, <fpage>193</fpage>&#x02013;<lpage>222</lpage>. <pub-id pub-id-type="doi">10.1146/annurev.ne.18.030195.001205</pub-id><pub-id pub-id-type="pmid">7605061</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>De Valois</surname> <given-names>R. L.</given-names></name> <name><surname>De Valois</surname> <given-names>K. K.</given-names></name></person-group> (<year>1988</year>). <source>Spatial Vision</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>DeWeese</surname> <given-names>M. R.</given-names></name> <name><surname>Meister</surname> <given-names>M.</given-names></name></person-group> (<year>1999</year>). <article-title>How to measure the information gained from one symbol</article-title>. <source>Network</source> <volume>10</volume>, <fpage>325</fpage>&#x02013;<lpage>340</lpage>. <pub-id pub-id-type="doi">10.1088/0954-898X/10/4/303</pub-id><pub-id pub-id-type="pmid">10695762</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name> <name><surname>Maunsell</surname> <given-names>J. H. R.</given-names></name></person-group> (<year>2003</year>). <article-title>Anterior inferotemporal neurons of monkeys engaged in object recognition can be highly sensitive to object retinal position</article-title>. <source>J. Neurophysiol</source>. <volume>89</volume>, <fpage>3264</fpage>&#x02013;<lpage>3278</lpage>. <pub-id pub-id-type="doi">10.1152/jn.00358.2002</pub-id><pub-id pub-id-type="pmid">12783959</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name> <name><surname>Zoccolan</surname> <given-names>D.</given-names></name> <name><surname>Rust</surname> <given-names>N. C.</given-names></name></person-group> (<year>2012</year>). <article-title>How does the brain solve visual object recognition?</article-title> <source>Neuron</source> <volume>73</volume>, <fpage>415</fpage>&#x02013;<lpage>434</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2012.01.010</pub-id><pub-id pub-id-type="pmid">22325196</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ehinger</surname> <given-names>K. A.</given-names></name> <name><surname>Hidalgo-Sotelo</surname> <given-names>B.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>Modeling search for people in 900 scenes: a combined source model of eye guidance</article-title>. <source>Vis. Cogn</source>. <volume>17</volume>, <fpage>945</fpage>&#x02013;<lpage>978</lpage>. <pub-id pub-id-type="doi">10.1080/13506280902834720</pub-id><pub-id pub-id-type="pmid">20011676</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Einhauser</surname> <given-names>W.</given-names></name> <name><surname>Eggert</surname> <given-names>J.</given-names></name> <name><surname>Korner</surname> <given-names>E.</given-names></name> <name><surname>Konig</surname> <given-names>P.</given-names></name></person-group> (<year>2005</year>). <article-title>Learning viewpoint invariant object representations using a temporal coherence principle</article-title>. <source>Biol. Cybern</source>. <volume>93</volume>, <fpage>79</fpage>&#x02013;<lpage>90</lpage>. <pub-id pub-id-type="doi">10.1007/s00422-005-0585-8</pub-id><pub-id pub-id-type="pmid">16021516</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elliffe</surname> <given-names>M. C. M.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Stringer</surname> <given-names>S. M.</given-names></name></person-group> (<year>2002</year>). <article-title>Invariant recognition of feature combinations in the visual system, <italic>Biol</italic></article-title>. <source>Cybern</source>. <volume>86</volume>, <fpage>59</fpage>&#x02013;<lpage>71</lpage>. <pub-id pub-id-type="doi">10.1007/s004220100284</pub-id><pub-id pub-id-type="pmid">11924570</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>F&#x000F6;ldi&#x000E1;k</surname> <given-names>P.</given-names></name></person-group> (<year>1991</year>). <article-title>Learning invariance from transformation sequences</article-title>. <source>Neural Comput</source>. <volume>3</volume>, <fpage>193</fpage>&#x02013;<lpage>199</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1991.3.2.194</pub-id><pub-id pub-id-type="pmid">17716007</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>F&#x000F6;ldi&#x000E1;k</surname> <given-names>P.</given-names></name></person-group> (<year>1992</year>). <source>Models of Sensory Coding</source>. Technical Report CUED/F&#x02013;INFENG/TR 91, <publisher-loc>Cambridge</publisher-loc>: <publisher-name>University of Cambridge</publisher-name>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Franco</surname> <given-names>L.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Aggelopoulos</surname> <given-names>N. C.</given-names></name> <name><surname>Jerez</surname> <given-names>J. M.</given-names></name></person-group> (<year>2007</year>). <article-title>Neuronal selectivity, population sparseness, and ergodicity in the inferior temporal visual cortex</article-title>. <source>Biol. Cybernet</source>. <volume>96</volume>, <fpage>547</fpage>&#x02013;<lpage>560</lpage>. <pub-id pub-id-type="doi">10.1007/s00422-007-0149-1</pub-id><pub-id pub-id-type="pmid">17410377</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Franco</surname> <given-names>L.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Aggelopoulos</surname> <given-names>N. C.</given-names></name> <name><surname>Treves</surname> <given-names>A.</given-names></name></person-group> (<year>2004</year>). <article-title>The use of decoding to analyze the contribution to the information of the correlations between the firing of simultaneously recorded neurons</article-title>. <source>Exp. Brain Res</source>. <volume>155</volume>, <fpage>370</fpage>&#x02013;<lpage>384</lpage>. <pub-id pub-id-type="doi">10.1007/s00221-003-1737-5</pub-id><pub-id pub-id-type="pmid">14722699</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Franzius</surname> <given-names>M.</given-names></name> <name><surname>Sprekeler</surname> <given-names>H.</given-names></name> <name><surname>Wiskott</surname> <given-names>L.</given-names></name></person-group> (<year>2007</year>). <article-title>Slowness and sparseness lead to place, head-direction, and spatial-view cells</article-title>. <source>PLoS Comput. Biol</source>. <volume>3</volume>:<fpage>e166</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.0030166</pub-id><pub-id pub-id-type="pmid">17784780</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fukushima</surname> <given-names>K.</given-names></name></person-group> (<year>1980</year>). <article-title>Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position</article-title>. <source>Biol. Cybernet</source>. <volume>36</volume>, <fpage>193</fpage>&#x02013;<lpage>202</lpage>. <pub-id pub-id-type="doi">10.1007/BF00344251</pub-id><pub-id pub-id-type="pmid">7370364</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garthwaite</surname> <given-names>J.</given-names></name></person-group> (<year>2008</year>). <article-title>Concepts of neural nitric oxide-mediated transmission</article-title>. <source>Eur. J. Neurosci</source>. <volume>27</volume>, <fpage>2783</fpage>&#x02013;<lpage>3802</lpage>. <pub-id pub-id-type="doi">10.1111/j.1460-9568.2008.06285.x</pub-id><pub-id pub-id-type="pmid">18588525</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goferman</surname> <given-names>S.</given-names></name> <name><surname>Zelnik-Manor</surname> <given-names>L.</given-names></name> <name><surname>Tal</surname> <given-names>A.</given-names></name></person-group> (<year>2012</year>). <article-title>Context-aware saliency detection</article-title>. <source>Pattern Anal. Mach. Intel. IEEE Trans</source>. <volume>34</volume>, <fpage>1915</fpage>&#x02013;<lpage>1926</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2011.272</pub-id><pub-id pub-id-type="pmid">22201056</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goldberg</surname> <given-names>M. E.</given-names></name> <name><surname>Bisley</surname> <given-names>J. W.</given-names></name> <name><surname>Powell</surname> <given-names>K. D.</given-names></name> <name><surname>Gottlieb</surname> <given-names>J.</given-names></name></person-group> (<year>2006</year>). <article-title>Saccades, salience and attention: the role of the lateral intraparietal area in visual behavior</article-title>. <source>Prog. Brain Res</source>. <volume>155</volume>, <fpage>157</fpage>&#x02013;<lpage>175</lpage>. <pub-id pub-id-type="doi">10.1016/S0079-6123(06)55010-1</pub-id><pub-id pub-id-type="pmid">17027387</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Griffin</surname> <given-names>G.</given-names></name> <name><surname>Holub</surname> <given-names>A.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name></person-group> (<year>2007</year>). <source>The Caltech-256. Caltech Technical Report</source>. <publisher-loc>Los Angeles, CA</publisher-loc>: <publisher-name>California Institute of Technology</publisher-name>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gross</surname> <given-names>C.</given-names></name> <name><surname>Bender</surname> <given-names>D.</given-names></name> <name><surname>Rocha-Miranda</surname> <given-names>C.</given-names></name></person-group> (<year>1969</year>). <article-title>Visual receptive fields of neurons in inferotemporal cortex of the monkey</article-title>. <source>Science</source> <volume>166</volume>, <fpage>1303</fpage>&#x02013;<lpage>1306</lpage>. <pub-id pub-id-type="doi">10.1126/science.166.3910.1303</pub-id><pub-id pub-id-type="pmid">4982685</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Harel</surname> <given-names>J.</given-names></name> <name><surname>Koch</surname> <given-names>C.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name></person-group> (<year>2006a</year>). <article-title>A Saliency Implementation in MATLAB</article-title>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://www.vision.caltech.edu/~harel/share/gbvs.php">http://www.vision.caltech.edu/~harel/share/gbvs.php</ext-link></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harel</surname> <given-names>J.</given-names></name> <name><surname>Koch</surname> <given-names>C.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name></person-group> (<year>2006b</year>). <article-title>Graph-based visual saliency</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <fpage>545</fpage>&#x02013;<lpage>552</lpage>. <pub-id pub-id-type="pmid">24427198</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hasselmo</surname> <given-names>M. E.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Baylis</surname> <given-names>G. C.</given-names></name> <name><surname>Nalwa</surname> <given-names>V.</given-names></name></person-group> (<year>1989</year>). <article-title>Object-centered encoding by face-selective neurons in the cortex in the superior temporal sulcus of the monkey</article-title>. <source>Exp. Brain Res</source>. <volume>75</volume>, <fpage>417</fpage>&#x02013;<lpage>429</lpage>. <pub-id pub-id-type="doi">10.1007/BF00247948</pub-id><pub-id pub-id-type="pmid">2721619</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hawken</surname> <given-names>M. J.</given-names></name> <name><surname>Parker</surname> <given-names>A. J.</given-names></name></person-group> (<year>1987</year>). <article-title>Spatial properties of the monkey striate cortex</article-title>. <source>Proc. R. Soc. Lond. B</source> <volume>231</volume>, <fpage>251</fpage>&#x02013;<lpage>288</lpage>. <pub-id pub-id-type="doi">10.1098/rspb.1987.0044</pub-id><pub-id pub-id-type="pmid">2889214</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hayhoe</surname> <given-names>M.</given-names></name> <name><surname>Ballard</surname> <given-names>D.</given-names></name></person-group> (<year>2005</year>). <article-title>Eye movements in natural behavior</article-title>. <source>Trends Cogn. Sci</source>. <volume>9</volume>, <fpage>188</fpage>&#x02013;<lpage>194</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2005.02.009</pub-id><pub-id pub-id-type="pmid">15808501</pub-id></citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hestrin</surname> <given-names>S.</given-names></name> <name><surname>Sah</surname> <given-names>P.</given-names></name> <name><surname>Nicoll</surname> <given-names>R.</given-names></name></person-group> (<year>1990</year>). <article-title>Mechanisms generating the time course of dual component excitatory synaptic currents recorded in hippocampal slices</article-title>. <source>Neuron</source> <volume>5</volume>, <fpage>247</fpage>&#x02013;<lpage>253</lpage>. <pub-id pub-id-type="doi">10.1016/0896-6273(90)90162-9</pub-id><pub-id pub-id-type="pmid">1976014</pub-id></citation>
</ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hummel</surname> <given-names>J. E.</given-names></name> <name><surname>Biederman</surname> <given-names>I.</given-names></name></person-group> (<year>1992</year>). <article-title>Dynamic binding in a neural network for shape recognition</article-title>. <source>Psychol. Rev</source>. <volume>99</volume>, <fpage>480</fpage>&#x02013;<lpage>517</lpage>. <pub-id pub-id-type="doi">10.1037/0033-295X.99.3.480</pub-id><pub-id pub-id-type="pmid">1502274</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Isik</surname> <given-names>L.</given-names></name> <name><surname>Leibo</surname> <given-names>J. Z.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>2012</year>). <article-title>Learning and disrupting invariance in visual recognition with a temporal association rule</article-title>. <source>Front. Comput. Neurosci</source>. <volume>6</volume>:<issue>37</issue>. <pub-id pub-id-type="doi">10.3389/fncom.2012.00037</pub-id><pub-id pub-id-type="pmid">22754523</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Itti</surname> <given-names>L.</given-names></name> <name><surname>Koch</surname> <given-names>C.</given-names></name></person-group> (<year>2000</year>). <article-title>A saliency-based search mechanism for overt and covert shifts of visual attention</article-title>. <source>Vis. Res</source>. <volume>40</volume>, <fpage>1489</fpage>&#x02013;<lpage>1506</lpage>. <pub-id pub-id-type="doi">10.1016/S0042-6989(99)00163-7</pub-id><pub-id pub-id-type="pmid">10788654</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jia</surname> <given-names>C.</given-names></name> <name><surname>Hou</surname> <given-names>F.</given-names></name> <name><surname>Duan</surname> <given-names>L.</given-names></name></person-group> (<year>2013</year>). <article-title>Visual saliency based on local and global features in the spatial domain</article-title>. <source>Int. J. Comput. Sci</source>. <volume>10</volume>, <fpage>3</fpage>, 713&#x02013;719.</citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kanan</surname> <given-names>C.</given-names></name></person-group> (<year>2013</year>). <article-title>Active object recognition with a space-variant retina</article-title>. <source>ISRN Mach. Vis</source>. <volume>2013</volume>:<fpage>138057</fpage>. <pub-id pub-id-type="doi">10.1155/2013/138057</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Kanan</surname> <given-names>C.</given-names></name> <name><surname>Cottrell</surname> <given-names>G. W.</given-names></name></person-group> (<year>2010</year>). <article-title>Robust classification of objects, faces, and flowers using natural image statistics</article-title>, in <source>Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (IEEE)</source>, <fpage>2472</fpage>&#x02013;<lpage>2479</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2010.5539947</pub-id></citation>
</ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kanan</surname> <given-names>C.</given-names></name> <name><surname>Tong</surname> <given-names>M. H.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Cottrell</surname> <given-names>G. W.</given-names></name></person-group> (<year>2009</year>). <article-title>SUN: Top-down saliency using natural statistics</article-title>. <source>Vis. Cognit</source>. <volume>17</volume>, <fpage>979</fpage>&#x02013;<lpage>1003</lpage>. <pub-id pub-id-type="doi">10.1080/13506280902771138</pub-id><pub-id pub-id-type="pmid">21052485</pub-id></citation>
</ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Knudsen</surname> <given-names>E. I.</given-names></name></person-group> (<year>2011</year>). <article-title>Control from below: the role of a midbrain network in spatial attention</article-title>. <source>Eur. J. Neurosci</source>. <volume>33</volume>, <fpage>1961</fpage>&#x02013;<lpage>1972</lpage>. <pub-id pub-id-type="doi">10.1111/j.1460-9568.2011.07696.x</pub-id><pub-id pub-id-type="pmid">21645092</pub-id></citation>
</ref>
<ref id="B54">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Kootstra</surname> <given-names>G.</given-names></name> <name><surname>Bergstrom</surname> <given-names>N.</given-names></name> <name><surname>Kragic</surname> <given-names>D.</given-names></name></person-group> (<year>2010</year>). <article-title>Fast and automatic detection and segmentation of unknown objects</article-title>, in <source>Humanoid Robots (Humanoids), 2010 10th IEEE-RAS International Conference (IEEE)</source>, <fpage>442</fpage>&#x02013;<lpage>447</lpage>. <pub-id pub-id-type="doi">10.1109/ICHR.2010.5686837</pub-id></citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2010</year>). <article-title>Learning to combine foveal glimpses with a third-order Boltzmann machine</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>1</volume>, <fpage>1243</fpage>&#x02013;<lpage>1251</lpage>.</citation>
</ref>
<ref id="B56">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>T. S.</given-names></name></person-group> (<year>1996</year>). <article-title>Image representation using 2D Gabor wavelets</article-title>. <source>IEEE Trans. Patt. Anal. Mach. Intell</source>. <volume>18</volume>, <fpage>959</fpage>&#x02013;<lpage>971</lpage>. <pub-id pub-id-type="doi">10.1109/34.541406</pub-id></citation>
</ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Levine</surname> <given-names>M. D.</given-names></name> <name><surname>An</surname> <given-names>X.</given-names></name> <name><surname>Xu</surname> <given-names>X.</given-names></name> <name><surname>He</surname> <given-names>H.</given-names></name></person-group> (<year>2013</year>). <article-title>Visual saliency based on scale-space analysis in the frequency domain</article-title>. <source>IEEE Trans. Patt. Anal. Mach. Intell</source>. <volume>35</volume>, <fpage>996</fpage>&#x02013;<lpage>1010</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2012.147</pub-id><pub-id pub-id-type="pmid">22802112</pub-id></citation>
</ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>N.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2008</year>). <article-title>Unsupervised natural experience rapidly alters invariant object representation in visual cortex</article-title>. <source>Science</source> <volume>321</volume>, <fpage>1502</fpage>&#x02013;<lpage>1507</lpage>. <pub-id pub-id-type="doi">10.1126/science.1160028</pub-id><pub-id pub-id-type="pmid">18787171</pub-id></citation>
</ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>N.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2010</year>). <article-title>Unsupervised natural visual experience rapidly reshapes size-invariant object representation in inferior temporal cortex</article-title>. <source>Neuron</source> <volume>67</volume>, <fpage>1062</fpage>&#x02013;<lpage>1075</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2010.08.029</pub-id><pub-id pub-id-type="pmid">20869601</pub-id></citation>
</ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>N.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2012</year>). <article-title>Neuronal learning of invariant object representation in the ventral visual stream is not dependent on reward</article-title>. <source>J. Neurosci</source>. <volume>32</volume>, <fpage>6611</fpage>&#x02013;<lpage>6620</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.3786-11.2012</pub-id><pub-id pub-id-type="pmid">22573683</pub-id></citation>
</ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Malsburg</surname> <given-names>C. V. D.</given-names></name></person-group> (<year>1973</year>). <article-title>Self-organization of orientation-sensitive columns in the striate cortex</article-title>. <source>Kybernetik</source> <volume>14</volume>, <fpage>85</fpage>&#x02013;<lpage>100</lpage>. <pub-id pub-id-type="doi">10.1007/BF00288907</pub-id><pub-id pub-id-type="pmid">4786750</pub-id></citation>
</ref>
<ref id="B62">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Marr</surname> <given-names>D.</given-names></name></person-group> (<year>1982</year>). <source>Vision</source>. (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Freeman</publisher-name>).</citation>
</ref>
<ref id="B63">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miller</surname> <given-names>E. K.</given-names></name> <name><surname>Buschman</surname> <given-names>T. J.</given-names></name></person-group> (<year>2013</year>). <article-title>Cortical circuits for the control of attention</article-title>. <source>Curr. Opin. Neurobiol</source>. <volume>23</volume>, <fpage>216</fpage>&#x02013;<lpage>222</lpage>. <pub-id pub-id-type="doi">10.1016/j.conb.2012.11.011</pub-id><pub-id pub-id-type="pmid">23265963</pub-id></citation>
</ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miyashita</surname> <given-names>Y.</given-names></name></person-group> (<year>1988</year>). <article-title>Neuronal correlate of visual associative long-term memory in the primate temporal cortex</article-title>. <source>Nature</source> <volume>335</volume>, <fpage>817</fpage>&#x02013;<lpage>820</lpage>. <pub-id pub-id-type="doi">10.1038/335817a0</pub-id><pub-id pub-id-type="pmid">3185711</pub-id></citation>
</ref>
<ref id="B65">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Montague</surname> <given-names>P. R.</given-names></name> <name><surname>Gally</surname> <given-names>J. A.</given-names></name> <name><surname>Edelman</surname> <given-names>G. M.</given-names></name></person-group> (<year>1991</year>). <article-title>Spatial signalling in the development and function of neural connections</article-title>. <source>Cereb. Cortex</source> <volume>1</volume>, <fpage>199</fpage>&#x02013;<lpage>220</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/1.3.199</pub-id><pub-id pub-id-type="pmid">1822733</pub-id></citation>
</ref>
<ref id="B66">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mutch</surname> <given-names>J.</given-names></name> <name><surname>Lowe</surname> <given-names>D. G.</given-names></name></person-group> (<year>2008</year>). <article-title>Object class recognition and localization using sparse features with limited receptive fields</article-title>. <source>Int. J. Comput. Vis</source>. <volume>80</volume>, <fpage>45</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-007-0118-0</pub-id></citation>
</ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oja</surname> <given-names>E.</given-names></name></person-group> (<year>1982</year>). <article-title>A simplified neuron model as a principal component analyzer</article-title>. <source>J. Math. Biol</source>. <volume>15</volume>, <fpage>267</fpage>&#x02013;<lpage>273</lpage>. <pub-id pub-id-type="doi">10.1007/BF00275687</pub-id><pub-id pub-id-type="pmid">7153672</pub-id></citation>
</ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oliva</surname> <given-names>A.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name></person-group> (<year>2006</year>). <article-title>Building the gist of a scene: the role of global image features in recognition</article-title>. <source>Prog. Brain Res</source>. <volume>155</volume>, <fpage>23</fpage>&#x02013;<lpage>36</lpage>. <pub-id pub-id-type="doi">10.1016/S0079-6123(06)55002-2</pub-id><pub-id pub-id-type="pmid">17027377</pub-id></citation>
</ref>
<ref id="B69">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Op de Beeck</surname> <given-names>H.</given-names></name> <name><surname>Vogels</surname> <given-names>R.</given-names></name></person-group> (<year>2000</year>). <article-title>Spatial sensitivity of macaque inferior temporal neurons</article-title>. <source>J. Comp. Neurol</source>. <volume>426</volume>, <fpage>505</fpage>&#x02013;<lpage>518</lpage>. <pub-id pub-id-type="doi">10.1002/1096-9861(20001030)426:4&#x0003C;505::AID-CNE1&#x0003E;3.0.CO;2-M</pub-id><pub-id pub-id-type="pmid">11027395</pub-id></citation>
</ref>
<ref id="B70">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Panzeri</surname> <given-names>S.</given-names></name> <name><surname>Treves</surname> <given-names>A.</given-names></name> <name><surname>Schultz</surname> <given-names>S.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>1999</year>). <article-title>On decoding the responses of a population of neurons from short time epochs</article-title>. <source>Neural Comput</source>. <volume>11</volume>, <fpage>1553</fpage>&#x02013;<lpage>1577</lpage>. <pub-id pub-id-type="doi">10.1162/089976699300016142</pub-id><pub-id pub-id-type="pmid">10490938</pub-id></citation>
</ref>
<ref id="B71">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perrett</surname> <given-names>D. I.</given-names></name> <name><surname>Oram</surname> <given-names>M. W.</given-names></name> <name><surname>Harries</surname> <given-names>M. H.</given-names></name> <name><surname>Bevan</surname> <given-names>R.</given-names></name> <name><surname>Hietanen</surname> <given-names>J. K.</given-names></name> <name><surname>Benson</surname> <given-names>P. J.</given-names></name></person-group> (<year>1991</year>). <article-title>Viewer&#x02013;centered and object centered coding of heads in the macaque temporal cortex</article-title>. <source>Exp. Brain Res</source>. <volume>86</volume>, <fpage>159</fpage>&#x02013;<lpage>173</lpage>. <pub-id pub-id-type="doi">10.1007/BF00231050</pub-id><pub-id pub-id-type="pmid">1756786</pub-id></citation>
</ref>
<ref id="B72">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perry</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Stringer</surname> <given-names>S. M.</given-names></name></person-group> (<year>2006</year>). <article-title>Spatial vs temporal continuity in view invariant visual object recognition learning</article-title>. <source>Vis. Res</source>. <volume>46</volume>, <fpage>3994</fpage>&#x02013;<lpage>4006</lpage>. <pub-id pub-id-type="doi">10.1016/j.visres.2006.07.025</pub-id><pub-id pub-id-type="pmid">16996556</pub-id></citation>
</ref>
<ref id="B73">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perry</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Stringer</surname> <given-names>S. M.</given-names></name></person-group> (<year>2010</year>). <article-title>Continuous transformation learning of translation invariant representations</article-title>. <source>Exp. Brain Res</source>. <volume>204</volume>, <fpage>255</fpage>&#x02013;<lpage>270</lpage>. <pub-id pub-id-type="doi">10.1007/s00221-010-2309-0</pub-id><pub-id pub-id-type="pmid">20544186</pub-id></citation>
</ref>
<ref id="B74">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pinto</surname> <given-names>N.</given-names></name> <name><surname>Doukhan</surname> <given-names>D.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name> <name><surname>Cox</surname> <given-names>D. D.</given-names></name></person-group> (<year>2009</year>). <article-title>A high-throughput screening approach to discovering good forms of biologically inspired visual representation</article-title>. <source>PLoS Comput. Biol</source>. <volume>5</volume>:<fpage>e1000579</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1000579</pub-id><pub-id pub-id-type="pmid">19956750</pub-id></citation>
</ref>
<ref id="B75">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pollen</surname> <given-names>D.</given-names></name> <name><surname>Ronner</surname> <given-names>S.</given-names></name></person-group> (<year>1981</year>). <article-title>Phase relationship between adjacent simple cells in the visual cortex</article-title>. <source>Science</source> <volume>212</volume>, <fpage>1409</fpage>&#x02013;<lpage>1411</lpage>. <pub-id pub-id-type="doi">10.1126/science.7233231</pub-id><pub-id pub-id-type="pmid">7233231</pub-id></citation>
</ref>
<ref id="B76">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rhodes</surname> <given-names>P.</given-names></name></person-group> (<year>1992</year>). <article-title>The open time of the NMDA channel facilitates the self-organisation of invariant object responses in cortex</article-title>. <source>Soc. Neurosci. Abstr</source>. <volume>18</volume>, <fpage>740</fpage>.</citation>
</ref>
<ref id="B77">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Riche</surname> <given-names>N.</given-names></name> <name><surname>Mancas</surname> <given-names>M.</given-names></name> <name><surname>Gosselin</surname> <given-names>B.</given-names></name> <name><surname>Dutoit</surname> <given-names>T.</given-names></name></person-group> (<year>2012</year>). <article-title>Rare: a new bottom-up saliency model</article-title>, in <source>Image Processing, 2012 19th IEEE Conference on (IEEE)</source>, <fpage>641</fpage>&#x02013;<lpage>644</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP.2012.6466941</pub-id></citation>
</ref>
<ref id="B78">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Riesenhuber</surname> <given-names>M.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>2000</year>). <article-title>Models of object recognition</article-title>. <source>Nat. Neurosci. Suppl</source>. <volume>3</volume>, <fpage>1199</fpage>&#x02013;<lpage>1204</lpage>. <pub-id pub-id-type="doi">10.1038/81479</pub-id><pub-id pub-id-type="pmid">11127838</pub-id></citation>
</ref>
<ref id="B79">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Robinson</surname> <given-names>L.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2014</year>). <article-title>Invariant visual object recognition: the biological plausibility of two approaches</article-title>.</citation>
</ref>
<ref id="B80">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>1992</year>). <article-title>Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical visual areas</article-title>. <source>Philos. Trans. R. Soc</source>. <volume>335</volume>, <fpage>11</fpage>&#x02013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1098/rstb.1992.0002</pub-id><pub-id pub-id-type="pmid">1348130</pub-id></citation>
</ref>
<ref id="B81">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>1995</year>). <article-title>Learning mechanisms in the temporal lobe visual cortex</article-title>. <source>Behav. Brain Res</source>. <volume>66</volume>, <fpage>177</fpage>&#x02013;<lpage>185</lpage>. <pub-id pub-id-type="doi">10.1016/0166-4328(94)00138-6</pub-id><pub-id pub-id-type="pmid">7755888</pub-id></citation>
</ref>
<ref id="B82">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2000</year>). <article-title>Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition</article-title>. <source>Neuron</source> <volume>27</volume>, <fpage>205</fpage>&#x02013;<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1016/S0896-6273(00)00030-1</pub-id><pub-id pub-id-type="pmid">10985342</pub-id></citation>
</ref>
<ref id="B83">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2007</year>). <article-title>The representation of information about faces in the temporal and frontal lobes of primates including humans</article-title>. <source>Neuropsychologia</source> <volume>45</volume>, <fpage>124</fpage>&#x02013;<lpage>143</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuropsychologia.2006.04.019</pub-id><pub-id pub-id-type="pmid">16797609</pub-id></citation>
</ref>
<ref id="B84">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2008</year>). <source>Memory, Attention, and Decision-Making. A Unifying Computational Neuroscience Approach</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B85">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2012</year>). <article-title>Invariant visual object and face recognition: neural and computational bases, and a model, VisNet</article-title>. <source>Front. Comput. Neurosci</source>. <volume>6</volume>:<issue>35</issue>. <pub-id pub-id-type="doi">10.3389/fncom.2012.00035</pub-id><pub-id pub-id-type="pmid">22723777</pub-id></citation>
</ref>
<ref id="B86">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2014</year>). <source>Emotion and Decision-Making Explained</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B87">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Aggelopoulos</surname> <given-names>N. C.</given-names></name> <name><surname>Franco</surname> <given-names>L.</given-names></name> <name><surname>Treves</surname> <given-names>A.</given-names></name></person-group> (<year>2004</year>). <article-title>Information encoding in the inferior temporal visual cortex: contributions of the firing rates and the correlations between the firing of neurons</article-title>. <source>Biol. Cybern</source>. <volume>90</volume>, <fpage>19</fpage>&#x02013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1007/s00422-003-0451-5</pub-id><pub-id pub-id-type="pmid">14762721</pub-id></citation>
</ref>
<ref id="B88">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Aggelopoulos</surname> <given-names>N. C.</given-names></name> <name><surname>Zheng</surname> <given-names>F.</given-names></name></person-group> (<year>2003</year>). <article-title>The receptive fields of inferior temporal cortex neurons in natural scenes</article-title>. <source>J. Neurosci</source>. <volume>23</volume>, <fpage>339</fpage>&#x02013;<lpage>348</lpage>. <pub-id pub-id-type="pmid">12514233</pub-id></citation>
</ref>
<ref id="B89">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Baylis</surname> <given-names>G. C.</given-names></name></person-group> (<year>1986</year>). <article-title>Size and contrast have only small effects on the responses to faces of neurons in the cortex of the superior temporal sulcus of the monkey</article-title>. <source>Exp. Brain Res</source>. <volume>65</volume>, <fpage>38</fpage>&#x02013;<lpage>48</lpage>. <pub-id pub-id-type="doi">10.1007/BF00243828</pub-id><pub-id pub-id-type="pmid">3803509</pub-id></citation>
</ref>
<ref id="B90">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Baylis</surname> <given-names>G. C.</given-names></name> <name><surname>Hasselmo</surname> <given-names>M.</given-names></name> <name><surname>Nalwa</surname> <given-names>V.</given-names></name></person-group> (<year>1989</year>). <article-title>The representation of information in the temporal lobe visual cortical areas of macaque monkeys</article-title>, in <source>Seeing Contour and Colour</source>, eds <person-group person-group-type="editor"><name><surname>Kulikowski</surname> <given-names>J.</given-names></name> <name><surname>Dickinson</surname> <given-names>C.</given-names></name> <name><surname>Murray</surname> <given-names>I.</given-names></name></person-group> (<publisher-loc>Oxford</publisher-loc>: <publisher-name>Pergamon</publisher-name>).</citation>
</ref>
<ref id="B91">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Baylis</surname> <given-names>G. C.</given-names></name> <name><surname>Hasselmo</surname> <given-names>M. E.</given-names></name></person-group> (<year>1987</year>). <article-title>The responses of neurons in the cortex in the superior temporal sulcus of the monkey to band-pass spatial frequency filtered faces</article-title>. <source>Vision Res</source>. <volume>27</volume>, <fpage>311</fpage>&#x02013;<lpage>326</lpage>. <pub-id pub-id-type="doi">10.1016/0042-6989(87)90081-2</pub-id><pub-id pub-id-type="pmid">3660594</pub-id></citation>
</ref>
<ref id="B92">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Baylis</surname> <given-names>G. C.</given-names></name> <name><surname>Leonard</surname> <given-names>C. M.</given-names></name></person-group> (<year>1985</year>). <article-title>Role of low and high spatial frequencies in the face-selective responses of neurons in the cortex in the superior temporal sulcus</article-title>. <source>Vision Res</source>. <volume>25</volume>, <fpage>1021</fpage>&#x02013;<lpage>1035</lpage>. <pub-id pub-id-type="doi">10.1016/0042-6989(85)90091-4</pub-id><pub-id pub-id-type="pmid">4071982</pub-id></citation>
</ref>
<ref id="B93">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Deco</surname> <given-names>G.</given-names></name></person-group> (<year>2002</year>). <source>Computational Neuroscience of Vision</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B94">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Franco</surname> <given-names>L.</given-names></name> <name><surname>Aggelopoulos</surname> <given-names>N. C.</given-names></name> <name><surname>Jerez</surname> <given-names>J. M.</given-names></name></person-group> (<year>2006</year>). <article-title>Information in the first spike, the order of spikes, and the number of spikes provided by neurons in the inferior temporal visual cortex</article-title>. <source>Vision Res</source>. <volume>46</volume>, <fpage>4193</fpage>&#x02013;<lpage>4205</lpage>. <pub-id pub-id-type="doi">10.1016/j.visres.2006.07.026</pub-id><pub-id pub-id-type="pmid">17011607</pub-id></citation>
</ref>
<ref id="B95">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Milward</surname> <given-names>T.</given-names></name></person-group> (<year>2000</year>). <article-title>A model of invariant object recognition in the visual system: learning rules, activation functions, lateral inhibition, and information-based performance measures</article-title>. <source>Neural Comput</source>. <volume>12</volume>, <fpage>2547</fpage>&#x02013;<lpage>2572</lpage>. <pub-id pub-id-type="doi">10.1162/089976600300014845</pub-id><pub-id pub-id-type="pmid">11110127</pub-id></citation>
</ref>
<ref id="B96">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Stringer</surname> <given-names>S. M.</given-names></name></person-group> (<year>2001</year>). <article-title>Invariant object recognition in the visual system with error correction and temporal difference learning</article-title>. <source>Network</source> <volume>12</volume>, <fpage>111</fpage>&#x02013;<lpage>129</lpage>. <pub-id pub-id-type="doi">10.1080/net.12.2.111.129</pub-id><pub-id pub-id-type="pmid">11405418</pub-id></citation>
</ref>
<ref id="B97">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Stringer</surname> <given-names>S. M.</given-names></name></person-group> (<year>2006</year>). <article-title>Invariant visual object recognition: a model, with lighting invariance</article-title>. <source>J. Physiol. Paris</source> <volume>100</volume>, <fpage>43</fpage>&#x02013;<lpage>62</lpage>. <pub-id pub-id-type="doi">10.1016/j.jphysparis.2006.09.004</pub-id><pub-id pub-id-type="pmid">17071062</pub-id></citation>
</ref>
<ref id="B98">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Stringer</surname> <given-names>S. M.</given-names></name></person-group> (<year>2007</year>). <article-title>Invariant global motion recognition in the dorsal visual system: a unifying theory</article-title>. <source>Neural Comput</source>. <volume>19</volume>, <fpage>139</fpage>&#x02013;<lpage>169</lpage>. <pub-id pub-id-type="doi">10.1162/neco.2007.19.1.139</pub-id><pub-id pub-id-type="pmid">17134320</pub-id></citation>
</ref>
<ref id="B99">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Tovee</surname> <given-names>M. J.</given-names></name></person-group> (<year>1994</year>). <article-title>Processing speed in the cerebral cortex and the neurophysiology of visual masking</article-title>. <source>Proc. R. Soc. B</source> <volume>257</volume>, <fpage>9</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1098/rspb.1994.0087</pub-id><pub-id pub-id-type="pmid">8090795</pub-id></citation>
</ref>
<ref id="B100">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Tovee</surname> <given-names>M. J.</given-names></name></person-group> (<year>1995</year>). <article-title>Sparseness of the neuronal representation of stimuli in the primate temporal visual cortex</article-title>. <source>J. Neurophysiol</source>. <volume>73</volume>, <fpage>713</fpage>&#x02013;<lpage>726</lpage>. <pub-id pub-id-type="pmid">7760130</pub-id></citation>
</ref>
<ref id="B101">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Tovee</surname> <given-names>M. J.</given-names></name> <name><surname>Purcell</surname> <given-names>D. G.</given-names></name> <name><surname>Stewart</surname> <given-names>A. L.</given-names></name> <name><surname>Azzopardi</surname> <given-names>P.</given-names></name></person-group> (<year>1994</year>). <article-title>The responses of neurons in the temporal cortex of primates, and face identification and detection</article-title>. <source>Exp. Brain Res</source>. <volume>101</volume>, <fpage>474</fpage>&#x02013;<lpage>484</lpage>. <pub-id pub-id-type="doi">10.1007/BF00227340</pub-id><pub-id pub-id-type="pmid">7851514</pub-id></citation>
</ref>
<ref id="B102">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Treves</surname> <given-names>A.</given-names></name></person-group> (<year>1998</year>). <source>Neural Networks and Brain Function</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B103">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Treves</surname> <given-names>A.</given-names></name></person-group> (<year>2011</year>). <article-title>The neuronal encoding of information in the brain</article-title>. <source>Prog. Neurobiol</source>. <volume>95</volume>, <fpage>448</fpage>&#x02013;<lpage>490</lpage>. <pub-id pub-id-type="doi">10.1016/j.pneurobio.2011.08.002</pub-id><pub-id pub-id-type="pmid">21907758</pub-id></citation>
</ref>
<ref id="B104">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Treves</surname> <given-names>A.</given-names></name> <name><surname>Tovee</surname> <given-names>M. J.</given-names></name></person-group> (<year>1997a</year>). <article-title>The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex</article-title>. <source>Exp. Brain Res</source>. <volume>114</volume>, <fpage>149</fpage>&#x02013;<lpage>162</lpage>. <pub-id pub-id-type="pmid">9125461</pub-id></citation>
</ref>
<ref id="B105">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Treves</surname> <given-names>A.</given-names></name> <name><surname>Tovee</surname> <given-names>M.</given-names></name> <name><surname>Panzeri</surname> <given-names>S.</given-names></name></person-group> (<year>1997b</year>). <article-title>Information in the neuronal representation of individual stimuli in the primate temporal visual cortex</article-title>. <source>J. Comput. Neurosci</source>. <volume>4</volume>, <fpage>309</fpage>&#x02013;<lpage>333</lpage>. <pub-id pub-id-type="pmid">9427118</pub-id></citation>
</ref>
<ref id="B106">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Tromans</surname> <given-names>J. M.</given-names></name> <name><surname>Stringer</surname> <given-names>S. M.</given-names></name></person-group> (<year>2008</year>). <article-title>Spatial scene representations formed by self-organizing learning in a hippocampal extension of the ventral visual system</article-title>. <source>Eur. J. Neurosci</source>. <volume>28</volume>, <fpage>2116</fpage>&#x02013;<lpage>2127</lpage>. <pub-id pub-id-type="doi">10.1111/j.1460-9568.2008.06486.x</pub-id><pub-id pub-id-type="pmid">19046392</pub-id></citation>
</ref>
<ref id="B107">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Serre</surname> <given-names>T.</given-names></name> <name><surname>Kreiman</surname> <given-names>G.</given-names></name> <name><surname>Kouh</surname> <given-names>M.</given-names></name> <name><surname>Cadieu</surname> <given-names>C.</given-names></name> <name><surname>Knoblich</surname> <given-names>U.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>2007a</year>). <article-title>A quantitative theory of immediate visual recognition</article-title>. <source>Prog. Brain Res</source>. <volume>165</volume>, <fpage>33</fpage>&#x02013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1016/S0079-6123(06)65004-8</pub-id><pub-id pub-id-type="pmid">17925239</pub-id></citation>
</ref>
<ref id="B108">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Serre</surname> <given-names>T.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>2007b</year>). <article-title>A feedforward architecture accounts for rapid categorization</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>104</volume>, <fpage>6424</fpage>&#x02013;<lpage>6429</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0700622104</pub-id><pub-id pub-id-type="pmid">17404214</pub-id></citation>
</ref>
<ref id="B109">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Serre</surname> <given-names>T.</given-names></name> <name><surname>Wolf</surname> <given-names>L.</given-names></name> <name><surname>Bileschi</surname> <given-names>S.</given-names></name> <name><surname>Riesenhuber</surname> <given-names>M.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>2007c</year>). <article-title>Robust object recognition with cortex-like mechanisms</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>29</volume>, <fpage>411</fpage>&#x02013;<lpage>426</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2007.56</pub-id><pub-id pub-id-type="pmid">17224612</pub-id></citation>
</ref>
<ref id="B110">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sheinberg</surname> <given-names>D. L.</given-names></name> <name><surname>Logothetis</surname> <given-names>N. K.</given-names></name></person-group> (<year>2001</year>). <article-title>Noticing familiar objects in real world scenes: the role of temporal cortical neurons in natural vision</article-title>. <source>J. Neurosci</source>. <volume>21</volume>, <fpage>1340</fpage>&#x02013;<lpage>1350</lpage>. <pub-id pub-id-type="pmid">11160405</pub-id></citation>
</ref>
<ref id="B111">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Soltani</surname> <given-names>A.</given-names></name> <name><surname>Koch</surname> <given-names>C.</given-names></name></person-group> (<year>2010</year>). <article-title>Visual saliency computations: mechanisms, constraints, and the effect of feedback</article-title>. <source>J. Neurosci</source>. <volume>30</volume>, <fpage>12831</fpage>&#x02013;<lpage>12843</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.1517-10.2010</pub-id><pub-id pub-id-type="pmid">20861387</pub-id></citation>
</ref>
<ref id="B112">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Spruston</surname> <given-names>N.</given-names></name> <name><surname>Jonas</surname> <given-names>P.</given-names></name> <name><surname>Sakmann</surname> <given-names>B.</given-names></name></person-group> (<year>1995</year>). <article-title>Dendritic glutamate receptor channel in rat hippocampal CA3 and CA1 pyramidal neurons</article-title>. <source>J. Physiol</source>. <volume>482</volume>, <fpage>325</fpage>&#x02013;<lpage>352</lpage>. <pub-id pub-id-type="pmid">7536248</pub-id></citation>
</ref>
<ref id="B113">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stringer</surname> <given-names>S. M.</given-names></name> <name><surname>Perry</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Proske</surname> <given-names>J. H.</given-names></name></person-group> (<year>2006</year>). <article-title>Learning invariant object recognition in the visual system with continuous transformations</article-title>. <source>Biol. Cybern</source>. <volume>94</volume>, <fpage>128</fpage>&#x02013;<lpage>142</lpage>. <pub-id pub-id-type="doi">10.1007/s00422-005-0030-z</pub-id><pub-id pub-id-type="pmid">16369795</pub-id></citation>
</ref>
<ref id="B114">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stringer</surname> <given-names>S. M.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2000</year>). <article-title>Position invariant recognition in the visual system with cluttered environments</article-title>. <source>Neural Netw</source>. <volume>13</volume>, <fpage>305</fpage>&#x02013;<lpage>315</lpage>. <pub-id pub-id-type="doi">10.1016/S0893-6080(00)00017-4</pub-id><pub-id pub-id-type="pmid">10937964</pub-id></citation>
</ref>
<ref id="B115">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stringer</surname> <given-names>S. M.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2002</year>). <article-title>Invariant object recognition in the visual system with novel views of 3D objects</article-title>. <source>Neural Comput</source>. <volume>14</volume>, <fpage>2585</fpage>&#x02013;<lpage>2596</lpage>. <pub-id pub-id-type="doi">10.1162/089976602760407982</pub-id><pub-id pub-id-type="pmid">12433291</pub-id></citation>
</ref>
<ref id="B116">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stringer</surname> <given-names>S. M.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2008</year>). <article-title>Learning transform invariant object recognition in the visual system with multiple stimuli present during training</article-title>. <source>Neural Netw</source>. <volume>21</volume>, <fpage>888</fpage>&#x02013;<lpage>903</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2007.11.004</pub-id><pub-id pub-id-type="pmid">18440774</pub-id></citation>
</ref>
<ref id="B117">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stringer</surname> <given-names>S. M.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Tromans</surname> <given-names>J. M.</given-names></name></person-group> (<year>2007</year>). <article-title>Invariant object recognition with trace learning and multiple stimuli present during training</article-title>. <source>Network</source> <volume>18</volume>, <fpage>161</fpage>&#x02013;<lpage>187</lpage>. <pub-id pub-id-type="doi">10.1080/09548980701556055</pub-id><pub-id pub-id-type="pmid">17966074</pub-id></citation>
</ref>
<ref id="B118">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>1981</year>). <article-title>Towards a modern theory of adaptive networks: expectation and prediction</article-title>. <source>Psychol. Rev</source>. <volume>88</volume>, <fpage>135</fpage>&#x02013;<lpage>170</lpage>. <pub-id pub-id-type="doi">10.1037/0033-295X.88.2.135</pub-id><pub-id pub-id-type="pmid">7291377</pub-id></citation>
</ref>
<ref id="B119">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thorpe</surname> <given-names>S. J.</given-names></name></person-group> (<year>2009</year>). <article-title>The speed of categorization in the human visual system</article-title>. <source>Neuron</source> <volume>62</volume>, <fpage>168</fpage>&#x02013;<lpage>170</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2009.04.012</pub-id><pub-id pub-id-type="pmid">19409262</pub-id></citation>
</ref>
<ref id="B120">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Torralba</surname> <given-names>A.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name> <name><surname>Castelhano</surname> <given-names>M. S.</given-names></name> <name><surname>Henderson</surname> <given-names>J. M.</given-names></name></person-group> (<year>2006</year>). <article-title>Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search</article-title>. <source>Psychol. Rev</source>. <volume>113</volume>, <fpage>766</fpage>&#x02013;<lpage>786</lpage>. <pub-id pub-id-type="doi">10.1037/0033-295X.113.4.766</pub-id><pub-id pub-id-type="pmid">17014302</pub-id></citation>
</ref>
<ref id="B121">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tovee</surname> <given-names>M. J.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>1995</year>). <article-title>Information encoding in short firing rate epochs by single neurons in the primate temporal visual cortex</article-title>. <source>Visual Cogn</source>. <volume>2</volume>, <fpage>35</fpage>&#x02013;<lpage>58</lpage>. <pub-id pub-id-type="doi">10.1080/13506289508401721</pub-id></citation>
</ref>
<ref id="B122">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tovee</surname> <given-names>M. J.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Azzopardi</surname> <given-names>P.</given-names></name></person-group> (<year>1994</year>). <article-title>Translation invariance and the responses of neurons in the temporal visual cortical areas of primates</article-title>. <source>J. Neurophysiol</source>. <volume>72</volume>, <fpage>1049</fpage>&#x02013;<lpage>1060</lpage>.</citation>
</ref>
<ref id="B123">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tovee</surname> <given-names>M. J.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Treves</surname> <given-names>A.</given-names></name> <name><surname>Bellis</surname> <given-names>R. P.</given-names></name></person-group> (<year>1993</year>). <article-title>Information encoding and the responses of single neurons in the primate temporal visual cortex</article-title>. <source>J. Neurophysiol</source>. <volume>70</volume>, <fpage>640</fpage>&#x02013;<lpage>654</lpage>. <pub-id pub-id-type="pmid">8410164</pub-id></citation>
</ref>
<ref id="B124">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Trappenberg</surname> <given-names>T. P.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Stringer</surname> <given-names>S. M.</given-names></name></person-group> (<year>2002</year>). <article-title>Effective size of receptive fields of inferior temporal visual cortex neurons in natural scenes</article-title>, in <source>Advances in Neural Information Processing Systems</source>, <volume>Vol. 14</volume>, eds <person-group person-group-type="editor"><name><surname>Dietterich</surname> <given-names>T. G.</given-names></name> <name><surname>Becker</surname> <given-names>S.</given-names></name> <name><surname>Gharamani</surname> <given-names>Z.</given-names></name></person-group> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>293</fpage>&#x02013;<lpage>300</lpage>. <pub-id pub-id-type="pmid">14693189</pub-id></citation>
</ref>
<ref id="B125">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Treves</surname> <given-names>A.</given-names></name> <name><surname>Panzeri</surname> <given-names>S.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>Booth</surname> <given-names>M.</given-names></name> <name><surname>Wakeman</surname> <given-names>E. A.</given-names></name></person-group> (<year>1999</year>). <article-title>Firing rate distributions and efficiency of information transmission of inferior temporal cortex neurons to natural visual stimuli</article-title>. <source>Neural Comput</source>. <volume>11</volume>, <fpage>601</fpage>&#x02013;<lpage>631</lpage>. <pub-id pub-id-type="doi">10.1162/089976699300016593</pub-id><pub-id pub-id-type="pmid">10085423</pub-id></citation>
</ref>
<ref id="B126">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ungerleider</surname> <given-names>L. G.</given-names></name> <name><surname>Haxby</surname> <given-names>J. V.</given-names></name></person-group> (<year>1994</year>). <article-title>&#x0201C;What&#x0201D; and &#x0201C;Where&#x0201D; in the human brain</article-title>. <source>Curr. Opin. Neurobiol</source>. <volume>4</volume>, <fpage>157</fpage>&#x02013;<lpage>165</lpage>. <pub-id pub-id-type="doi">10.1016/0959-4388(94)90066-3</pub-id><pub-id pub-id-type="pmid">8038571</pub-id></citation>
</ref>
<ref id="B127">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ungerleider</surname> <given-names>L. G.</given-names></name> <name><surname>Mishkin</surname> <given-names>M.</given-names></name></person-group> (<year>1982</year>). <article-title>Two cortical visual systems</article-title>, in <source>Analysis of Visual Behaviour</source>, eds <person-group person-group-type="editor"><name><surname>Ingle</surname> <given-names>D.</given-names></name> <name><surname>Goodale</surname> <given-names>M. A.</given-names></name> <name><surname>R. J. W.</surname></name></person-group> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>Mansfield MIT Press</publisher-name>), <fpage>549</fpage>&#x02013;<lpage>586</lpage>.</citation>
</ref>
<ref id="B128">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Van Essen</surname> <given-names>D.</given-names></name> <name><surname>Anderson</surname> <given-names>C. H.</given-names></name> <name><surname>Felleman</surname> <given-names>D. J.</given-names></name></person-group> (<year>1992</year>). <article-title>Information processing in the primate visual system: an integrated systems perspective</article-title>. <source>Science</source> <volume>255</volume>, <fpage>419</fpage>&#x02013;<lpage>423</lpage>. <pub-id pub-id-type="doi">10.1126/science.1734518</pub-id><pub-id pub-id-type="pmid">1734518</pub-id></citation>
</ref>
<ref id="B129">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wallis</surname> <given-names>G.</given-names></name></person-group> (<year>2013</year>). <article-title>Toward a unified model of face and object recognition in the human visual system</article-title>. <source>Front. Psychol</source>. <volume>4</volume>:<issue>497</issue>. <pub-id pub-id-type="doi">10.3389/fpsyg.2013.00497</pub-id><pub-id pub-id-type="pmid">23966963</pub-id></citation>
</ref>
<ref id="B130">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wallis</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>1997</year>). <article-title>Invariant face and object recognition in the visual system</article-title>. <source>Prog. Neurobiol</source>. <volume>51</volume>, <fpage>167</fpage>&#x02013;<lpage>194</lpage>. <pub-id pub-id-type="doi">10.1016/S0301-0082(96)00054-8</pub-id><pub-id pub-id-type="pmid">9247963</pub-id></citation>
</ref>
<ref id="B131">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wallis</surname> <given-names>G.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name> <name><surname>F&#x000F6;ldi&#x000E1;k</surname> <given-names>P.</given-names></name></person-group> (<year>1993</year>). <article-title>Learning invariant responses to the natural transformations of objects</article-title>. <source>Int. Joint Conf. Neural Netw</source>. <volume>2</volume>, <fpage>1087</fpage>&#x02013;<lpage>1090</lpage>.</citation>
</ref>
<ref id="B132">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Walther</surname> <given-names>D.</given-names></name> <name><surname>Itti</surname> <given-names>L.</given-names></name> <name><surname>Riesenhuber</surname> <given-names>M.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name> <name><surname>Koch</surname> <given-names>C.</given-names></name></person-group> (<year>2002</year>). <article-title>Attentional selection for object recognition&#x02013;a gentle way</article-title>. <source>Biol. Mot. Comput. Vis</source>. <fpage>472</fpage>&#x02013;<lpage>479</lpage>.</citation>
</ref>
<ref id="B133">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Webb</surname> <given-names>T. J.</given-names></name> <name><surname>Rolls</surname> <given-names>E. T.</given-names></name></person-group> (<year>2014</year>). <article-title>Deformation-specific and deformation-invariant visual object recognition: pose vs identity recognition of people and deforming objects</article-title>. <source>Front. Comput. Neurosci</source>. <volume>8</volume>:<issue>37</issue>. <pub-id pub-id-type="doi">10.3389/fncom.2014.00037</pub-id><pub-id pub-id-type="pmid">24744725</pub-id></citation>
</ref>
<ref id="B134">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wiskott</surname> <given-names>L.</given-names></name></person-group> (<year>2003</year>). <article-title>Slow feature analysis: a theoretical analysis of optimal free responses</article-title>. <source>Neural Comput</source>. <volume>15</volume>, <fpage>2147</fpage>&#x02013;<lpage>2177</lpage>. <pub-id pub-id-type="doi">10.1162/089976603322297331</pub-id><pub-id pub-id-type="pmid">12959670</pub-id></citation>
</ref>
<ref id="B135">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wiskott</surname> <given-names>L.</given-names></name> <name><surname>Sejnowski</surname> <given-names>T. J.</given-names></name></person-group> (<year>2002</year>). <article-title>Slow feature analysis: unsupervised learning of invariances</article-title>. <source>Neural Comput</source>. <volume>14</volume>, <fpage>715</fpage>&#x02013;<lpage>770</lpage>. <pub-id pub-id-type="doi">10.1162/089976602317318938</pub-id><pub-id pub-id-type="pmid">11936959</pub-id></citation>
</ref>
<ref id="B136">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wyss</surname> <given-names>R.</given-names></name> <name><surname>Konig</surname> <given-names>P.</given-names></name> <name><surname>Verschure</surname> <given-names>P. F.</given-names></name></person-group> (<year>2006</year>). <article-title>A model of the ventral visual system based on temporal stability and local memory</article-title>. <source>PLoS Biol</source>. <volume>4</volume>:<fpage>e120</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pbio.0040120</pub-id><pub-id pub-id-type="pmid">16605306</pub-id></citation>
</ref>
<ref id="B137">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamins</surname> <given-names>D. L.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Cadieu</surname> <given-names>C. F.</given-names></name> <name><surname>Solomon</surname> <given-names>E. A.</given-names></name> <name><surname>Seibert</surname> <given-names>D.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2014</year>). <article-title>Performance-optimized hierarchical models predict neural responses in higher visual cortex</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>111</volume>, <fpage>8619</fpage>&#x02013;<lpage>8624</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1403112111</pub-id><pub-id pub-id-type="pmid">24812127</pub-id></citation>
</ref>
<ref id="B138">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Tong</surname> <given-names>M. H.</given-names></name> <name><surname>Marks</surname> <given-names>T. K.</given-names></name> <name><surname>Shan</surname> <given-names>H.</given-names></name> <name><surname>Cottrell</surname> <given-names>G. W.</given-names></name></person-group> (<year>2008</year>). <article-title>SUN: A Bayesian framework for saliency using natural statistics</article-title>. <source>J. Vis</source>. <volume>8</volume>:<fpage>32</fpage>. <pub-id pub-id-type="doi">10.1167/8.7.32</pub-id><pub-id pub-id-type="pmid">19146264</pub-id></citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>GBVS was used with its default parameters, except as follows: channels &#x0003D; CIO; gaborangles 0, 30, 60, 90, 120, 150; onCenterBias &#x0003D; 1; levels 2 3; sigma_frac_act &#x0003D; 0.35; sigma_frac_norm &#x0003D; 0.26.</p></fn>
</fn-group>
<app-group>
<app id="A1">
<title>A. Appendix: the architecture of VisNet</title>
<p>This Appendix describes the functional architecture, operation, and testing of VisNet as used in this paper. VisNet is a hierarchical feedforward 4-layer network that models properties of the ventral visual system involved in invariant visual object recognition (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>).</p>
<sec>
<title>A.1 The trace rule</title>
<p>The learning rule implemented in the VisNet simulations utilizes the spatio-temporal constraints placed upon the behavior of &#x0201C;real-world&#x0201D; objects to learn about natural object transformations. By presenting consistent sequences of transforming objects the cells in the network can learn to respond to the same object through all of its naturally transformed states, as described by F&#x000F6;ldi&#x000E1;k (<xref ref-type="bibr" rid="B29">1991</xref>), Rolls (<xref ref-type="bibr" rid="B80">1992</xref>), Wallis et al. (<xref ref-type="bibr" rid="B131">1993</xref>), Wallis and Rolls (<xref ref-type="bibr" rid="B130">1997</xref>), and Rolls (<xref ref-type="bibr" rid="B85">2012</xref>). The learning rule incorporates a decaying trace of previous cell activity and is henceforth referred to simply as the &#x0201C;trace&#x0201D; learning rule. The learning paradigm we describe here is intended in principle to enable learning of any of the transforms tolerated by inferior temporal cortex neurons, including position, size, view, lighting, and spatial frequency (Rolls, <xref ref-type="bibr" rid="B80">1992</xref>, <xref ref-type="bibr" rid="B82">2000</xref>; Rolls and Deco, <xref ref-type="bibr" rid="B93">2002</xref>; Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>).</p>
<p>Various biological bases for this temporal trace have been advanced as follows: [The precise mechanisms involved may alter the precise form of the trace rule which should be used. F&#x000F6;ldi&#x000E1;k (<xref ref-type="bibr" rid="B30">1992</xref>) describes an alternative trace rule which models individual NMDA channels. Equally, a trace implemented by temporally extended cell firing in a local cortical attractor could implement a short-term memory of previous neuronal firing (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>).]</p>
<list list-type="bullet">
<list-item><p>The persistent firing of neurons for as long as 100&#x02013;400 ms observed after presentations of stimuli for 16 ms (Rolls and Tovee, <xref ref-type="bibr" rid="B99">1994</xref>) could provide a time window within which to associate subsequent images. Maintained activity may potentially be implemented by recurrent connections between as well as within cortical areas (Rolls and Treves, <xref ref-type="bibr" rid="B102">1998</xref>; Rolls and Deco, <xref ref-type="bibr" rid="B93">2002</xref>; Rolls, <xref ref-type="bibr" rid="B84">2008</xref>). [The prolonged firing of inferior temporal cortex neurons during memory delay periods of several seconds, and associative links reported to develop between stimuli presented several seconds apart (Miyashita, <xref ref-type="bibr" rid="B64">1988</xref>) are on too long a time scale to be immediately relevant to the present theory. In fact, associations between visual events occurring several seconds apart would, under <italic>normal</italic> environmental conditions, be detrimental to the operation of a network of the type described here, because they would probably arise from different objects. In contrast, the system described benefits from associations between visual events which occur close in time (typically within 1 s), as they are likely to be from the same object.]</p></list-item>
<list-item><p>The binding period of glutamate in the NMDA channels, which may last for 100 ms or more, may implement a trace rule by producing a narrow time window over which the <italic>average</italic> activity at each presynaptic site affects learning (F&#x000F6;ldi&#x000E1;k, <xref ref-type="bibr" rid="B30">1992</xref>; Rolls, <xref ref-type="bibr" rid="B80">1992</xref>; Rhodes, <xref ref-type="bibr" rid="B76">1992</xref>; Spruston et al., <xref ref-type="bibr" rid="B112">1995</xref>; Hestrin et al., <xref ref-type="bibr" rid="B45">1990</xref>).</p></list-item>
<list-item><p>Chemicals such as nitric oxide may be released during high neural activity and gradually decay in concentration over a short time window during which learning could be enhanced (F&#x000F6;ldi&#x000E1;k, <xref ref-type="bibr" rid="B30">1992</xref>; Montague et al., <xref ref-type="bibr" rid="B65">1991</xref>; Garthwaite, <xref ref-type="bibr" rid="B35">2008</xref>).</p></list-item>
</list>
<p>The trace update rule used in the baseline simulations of VisNet (Wallis and Rolls, <xref ref-type="bibr" rid="B130">1997</xref>) is equivalent to both F&#x000F6;ldi&#x000E1;k&#x00027;s used in the context of translation invariance (Wallis et al., <xref ref-type="bibr" rid="B131">1993</xref>) and to the earlier rule of Sutton and Barto (<xref ref-type="bibr" rid="B118">1981</xref>) explored in the context of modeling the temporal properties of classical conditioning, and can be summarized as follows:</p>
<disp-formula id="E1"><label>(A1)</label><mml:math id="M1"><mml:mrow><mml:mi>&#x003B4;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover><mml:mi>&#x003C4;</mml:mi></mml:msup><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></disp-formula>
<p>where</p>
<disp-formula id="E2"><label>(A2)</label><mml:math id="M2"><mml:mrow><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover><mml:mi>&#x003C4;</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mi>&#x003C4;</mml:mi></mml:msup><mml:mo>+</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula>
<p>and</p>
<table-wrap position="float">
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left"><italic>x<sub>j</sub></italic>:</td>
<td align="left"><italic>j<sup>th</sup></italic> input to the neuron.</td>
<td align="left"><italic>y</italic>:</td>
<td align="left">Output from the neuron.</td>
</tr>
<tr>
<td align="left"><overline><italic>y</italic></overline><sup>&#x003C4;</sup>:</td>
<td align="left">Trace value of the output of the neuron at time step &#x003C4;.</td>
<td align="left">&#x003B1;:</td>
<td align="left">Learning rate.</td>
</tr>
<tr>
<td align="left"><italic>w<sub>j</sub></italic>:</td>
<td align="left">Synaptic weight between <italic>j<sup>th</sup></italic> input and the neuron.</td>
<td align="left">&#x003B7;:</td>
<td align="left">Trace value. The optimal value varies with presentation sequence length.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>At the start of a series of investigations of different forms of the trace learning rule, Rolls and Milward (<xref ref-type="bibr" rid="B95">2000</xref>) demonstrated that VisNet&#x00027;s performance could be greatly enhanced with a modified Hebbian trace learning rule (Equation A3) that incorporated a trace of activity from the preceding time steps, with no contribution from the activity being produced by the stimulus at the current time step. This rule took the form</p>
<disp-formula id="E3"><label>(A3)</label><mml:math id="M3"><mml:mrow><mml:mi>&#x003B4;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msup><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x000AF;</mml:mo></mml:mover><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msubsup><mml:mi>x</mml:mi><mml:mi>j</mml:mi><mml:mi>&#x003C4;</mml:mi></mml:msubsup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>
<p>The trace shown in Equation (A3) is in the postsynaptic term. The crucial difference from the earlier rule (see Equation A1) was that the trace should be calculated up to only the preceding timestep, with no contribution to the trace from the firing on the current trial to the current stimulus. This has the effect of updating the weights based on the preceding activity of the neuron, which is likely given the spatio-temporal statistics of the visual world to be from previous transforms of the same object (Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>; Rolls and Stringer, <xref ref-type="bibr" rid="B96">2001</xref>). This is biologically not at all implausible, as considered in more detail elsewhere (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>), and this version of the trace rule was used in this investigation.</p>
<p>The optimal value of &#x003B7; in the trace rule is likely to be different for different layers of VisNet. For early layers with small receptive fields, few successive transforms are likely to contain similar information within the receptive field, so the value for &#x003B7; might be low to produce a short trace. In later layers of VisNet, successive transforms may be in the receptive field for longer, and invariance may be developing in earlier layers, so a longer trace may be beneficial. In practice, after exploration we used &#x003B7; values of 0.6 for layer 2, and 0.8 for layers 3 and 4. In addition, it is important to form feature combinations with high spatial precision before invariance learning supported by a temporal trace starts, in order that the feature combinations and not the individual features have invariant representations (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>). For this reason, purely associative learning with no temporal trace was used in layer 1 of VisNet (Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>).</p>
<p>The following principled method was introduced to choose the value of the learning rate &#x003B1; for each layer. The mean weight change from all the neurons in that layer for each epoch of training was measured, and was set so that with slow learning over 15&#x02013;50 trials, the weight changes per epoch would gradually decrease and asymptote with that number of epochs, reflecting convergence. Slow learning rates are useful in competitive nets, for if the learning rates are too high, previous learning in the synaptic weights will be overwritten by large weight changes later within the same epoch produced if a neuron starts to respond to another stimulus (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>). If the learning rates are too low, then no useful learning or convergence will occur. It was found that the following learning rates enabled good operation with the 100 transforms of each of 4 stimuli used in each epoch in the present investigation: Layer 1 &#x003B1; &#x0003D; 0.05; Layer 2 &#x003B1; &#x0003D; 0.03 (this is relatively high to allow for the sparse representations in layer 1); Layer 3 &#x003B1; &#x0003D; 0.005; Layer 4 &#x003B1; &#x0003D; 0.005.</p>
<p>To bound the growth of each neuron&#x00027;s synaptic weight vector, <bold>w</bold><sub><italic>i</italic></sub> for the <italic>i</italic>th neuron, its length is explicitly normalized [a method similarly employed by Malsburg (<xref ref-type="bibr" rid="B61">1973</xref>) which is commonly used in competitive networks (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>)]. An alternative, more biologically relevant implementation, using a local weight bounding operation which utilizes a form of heterosynaptic long-term depression (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>), has in part been explored using a version of the (Oja, <xref ref-type="bibr" rid="B67">1982</xref>) rule (see Wallis and Rolls, <xref ref-type="bibr" rid="B130">1997</xref>).</p>
</sec>
<sec>
<title>A.2 The network implemented in VisNet</title>
<p>The network itself is designed as a series of hierarchical, convergent, competitive networks, in accordance with the hypotheses advanced above. The actual network consists of a series of four layers, constructed such that the convergence of information from the most disparate parts of the network&#x00027;s input layer can potentially influence firing in a single neuron in the final layer&#x02014;see Figure <xref ref-type="fig" rid="F1">1</xref>. This corresponds to the scheme described by many researchers (Van Essen et al., <xref ref-type="bibr" rid="B128">1992</xref>; Rolls, <xref ref-type="bibr" rid="B80">1992</xref>, <xref ref-type="bibr" rid="B84">2008</xref>, for example) as present in the primate visual system&#x02014;see Figure <xref ref-type="fig" rid="F1">1</xref>. The forward connections to a cell in one layer are derived from a topologically related and confined region of the preceding layer. The choice of whether a connection between neurons in adjacent layers exists or not is based upon a Gaussian distribution of connection probabilities which roll off radially from the focal point of connections for each neuron. (A minor extra constraint precludes the repeated connection of any pair of cells.) In particular, the forward connections to a cell in one layer come from a small region of the preceding layer defined by the radius in Table <xref ref-type="table" rid="TA1">A1</xref> which will contain approximately 67% of the connections from the preceding layer. Table <xref ref-type="table" rid="TA1">A1</xref> shows the dimensions for the research described here, a (16&#x000D7;) larger version than the version of VisNet used in most of our previous investigations, which utilized 32 &#x000D7; 32 neurons per layer. For the research on view and translation invariance learning described here, we decreased the number of connections to layer 1 neurons to 100 (from 272), in order to increase the selectivity of the network between objects. We increased the number of connections to each neuron in layers 2&#x02013;4 to 400 (from 100), because this helped layer 4 neurons to reflect evidence from neurons in previous layers about the large number of transforms (typically 100 transforms, from 4 views of each object and 25 locations) each of which corresponded to a particular object.</p>
<table-wrap position="float" id="TA1">
<label>Table A1</label>
<caption><p><bold>VisNet dimensions</bold>.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th align="left" valign="top"><bold>Dimensions</bold></th>
<th align="left" valign="top"><bold>&#x00023; Connections</bold></th>
<th align="left" valign="top"><bold>Radius</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Layer 4</td>
<td align="center" valign="top">128 &#x000D7; 128</td>
<td align="center" valign="top">400</td>
<td align="center" valign="top">48</td>
</tr>
<tr>
<td align="left" valign="top">Layer 3</td>
<td align="center" valign="top">128 &#x000D7; 128</td>
<td align="center" valign="top">400</td>
<td align="center" valign="top">36</td>
</tr>
<tr>
<td align="left" valign="top">Layer 2</td>
<td align="center" valign="top">128 &#x000D7; 128</td>
<td align="center" valign="top">400</td>
<td align="center" valign="top">24</td>
</tr>
<tr>
<td align="left" valign="top">Layer 1</td>
<td align="center" valign="top">128 &#x000D7; 128</td>
<td align="center" valign="top">100</td>
<td align="center" valign="top">24</td>
</tr>
<tr>
<td align="left" valign="top">Input layer</td>
<td align="center" valign="top">256 &#x000D7; 256 &#x000D7; 16</td>
<td align="center" valign="top">&#x02013;</td>
<td align="center" valign="top">&#x02013;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Figure <xref ref-type="fig" rid="F1">1</xref> shows the general convergent network architecture used. Localization and limitation of connectivity in the network is intended to mimic cortical connectivity, partially because of the clear retention of retinal topology through regions of visual cortex. This architecture also encourages the gradual combination of features from layer to layer which has relevance to the binding problem, as described elsewhere (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>).</p>
</sec>
<sec>
<title>A.3 Competition and lateral inhibition</title>
<p>In order to act as a competitive network some form of mutual inhibition is required within each layer, which should help to ensure that all stimuli presented are evenly represented by the neurons in each layer. This is implemented in VisNet by a form of lateral inhibition. The idea behind the lateral inhibition, apart from this being a property of cortical architecture in the brain, was to prevent too many neurons that received inputs from a similar part of the preceding layer responding to the same activity patterns. The purpose of the lateral inhibition was to ensure that different receiving neurons coded for different inputs. This is important in reducing redundancy (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>). The lateral inhibition is conceived as operating within a radius that was similar to that of the region within which a neuron received converging inputs from the preceding layer (because activity in one zone of topologically organized processing within a layer should not inhibit processing in another zone in the same layer, concerned perhaps with another part of the image). The lateral inhibition used in this investigation used the parameters for &#x003C3; shown in <bold>Table A3</bold>.</p>
<p>The lateral inhibition and contrast enhancement just described are actually implemented in VisNet2 (Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>) and VisNetL (Perry et al., <xref ref-type="bibr" rid="B73">2010</xref>) in two stages, to produce filtering of the type illustrated elsewhere (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B85">2012</xref>). The lateral inhibition was implemented by convolving the activation of the neurons in a layer with a spatial filter, <italic>I</italic>, where &#x003B4; controls the contrast and &#x003C3; controls the width, and <italic>a</italic> and <italic>b</italic> index the distance away from the center of the filter</p>
<disp-formula id="E4"><label>(A4)</label><mml:math id="M4"><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>a</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>b</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:msup><mml:mi>&#x003C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x02009;a</mml:mtext><mml:mo>&#x02260;</mml:mo><mml:mtext>0&#x02009;or&#x02009;b</mml:mtext><mml:mo>&#x02260;</mml:mo><mml:mtext>0</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:munder><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x02009;a&#x02009;=&#x02009;0&#x02009;and&#x02009;b&#x02009;=&#x02009;0</mml:mtext><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>This is a filter that leaves the average activity unchanged.</p>
<p>The second stage involves contrast enhancement. A sigmoid activation function was used in the way described previously (Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>):</p>
<disp-formula id="E5"><label>(A5)</label><mml:math id="M5"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mtext>f</mml:mtext><mml:mrow><mml:mtext>sigmoid</mml:mtext></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x003B2;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>where <italic>r</italic> is the activation (or firing rate) of the neuron after the lateral inhibition, <italic>y</italic> is the firing rate after the contrast enhancement produced by the activation function, and &#x003B2; is the slope or gain and &#x003B1; is the threshold or bias of the activation function. The sigmoid bounds the firing rate between 0 and 1 so global normalization is not required. The slope and threshold are held constant within each layer. The slope is constant throughout training, whereas the threshold is used to control the sparseness of firing rates within each layer. The (population) sparseness of the firing within a layer is defined (Rolls and Treves, <xref ref-type="bibr" rid="B102">1998</xref>; Franco et al., <xref ref-type="bibr" rid="B31">2007</xref>; Rolls, <xref ref-type="bibr" rid="B84">2008</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>) as:</p>
<disp-formula id="E6"><label>(A6)</label><mml:math id="M6"><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>/</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msup><mml:mn>2</mml:mn><mml:mtext>&#x000A0;</mml:mtext></mml:msup></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>/</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>where <italic>n</italic> is the number of neurons in the layer. To set the sparseness to a given value, e.g., 5%, the threshold is set to the value of the 95th percentile point of the activations within the layer.</p>
<p>The sigmoid activation function was used with parameters (selected after a number of optimization runs) as shown in Table <xref ref-type="table" rid="TA2">A2</xref>.</p>
<table-wrap position="float" id="TA2">
<label>Table A2</label>
<caption><p><bold>Sigmoid parameters for the runs with 25 locations by Rolls and Milward (<xref ref-type="bibr" rid="B95">2000</xref>)</bold>.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left" valign="top">Layer</td>
<td align="left" valign="top">1</td>
<td align="left" valign="top">2</td>
<td align="left" valign="top">3</td>
<td align="left" valign="top">4</td>
</tr>
<tr>
<td align="left" valign="top">Percentile</td>
<td align="left" valign="top">99.2</td>
<td align="left" valign="top">98</td>
<td align="left" valign="top">88</td>
<td align="left" valign="top">95</td>
</tr>
<tr>
<td align="left" valign="top">Slope &#x003B2;</td>
<td align="left" valign="top">190</td>
<td align="left" valign="top">40</td>
<td align="left" valign="top">75</td>
<td align="left" valign="top">26</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In addition, the lateral inhibition parameters are as shown in Table <xref ref-type="table" rid="TA3">A3</xref>.</p>
<table-wrap position="float" id="TA3">
<label>Table A3</label>
<caption><p><bold>Lateral inhibition parameters for the 25-location runs</bold>.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left" valign="top">Layer</td>
<td align="left" valign="top">1</td>
<td align="left" valign="top">2</td>
<td align="left" valign="top">3</td>
<td align="left" valign="top">4</td>
</tr>
<tr>
<td align="left" valign="top">Radius, &#x003C3;</td>
<td align="left" valign="top">1.38</td>
<td align="left" valign="top">2.7</td>
<td align="left" valign="top">4.0</td>
<td align="left" valign="top">6.0</td>
</tr>
<tr>
<td align="left" valign="top">Contrast, &#x003B4;</td>
<td align="left" valign="top">1.5</td>
<td align="left" valign="top">1.5</td>
<td align="left" valign="top">1.6</td>
<td align="left" valign="top">1.4</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>A.4 The input to VisNet</title>
<p>VisNet is provided with a set of input filters which can be applied to an image to produce inputs to the network which correspond to those provided by simple cells in visual cortical area 1 (V1). The purpose of this is to enable within VisNet the more complicated response properties of cells between V1 and the inferior temporal cortex (IT) to be investigated, using as inputs natural stimuli such as those that could be applied to the retina of the real visual system. This is to facilitate comparisons between the activity of neurons in VisNet and those in the real visual system, to the same stimuli. In VisNet no attempt is made to train the response properties of simple cells, but instead we start with a defined series of filters to perform fixed feature extraction to a level equivalent to that of simple cells in V1, as have other researchers in the field (Hummel and Biederman, <xref ref-type="bibr" rid="B46">1992</xref>; Buhmann et al., <xref ref-type="bibr" rid="B14">1991</xref>; Fukushima, <xref ref-type="bibr" rid="B34">1980</xref>), because we wish to simulate the more complicated response properties of cells between V1 and the inferior temporal cortex (IT). The elongated orientation-tuned input filters used accord with the general tuning profiles of simple cells in V1 (Hawken and Parker, <xref ref-type="bibr" rid="B43">1987</xref>) and were computed by Gabor filters. Each individual filter is tuned to spatial frequency (0.0626 to 0.5 cycles / pixel over four octaves); orientation (0&#x000B0; to 135&#x000B0; in steps of 45&#x000B0;); and sign (&#x000B1;1). Of the 100 layer 1 connections, the number to each group in VisNetL is as shown in Table <xref ref-type="table" rid="TA4">A4</xref>. Any zero D.C. filter can of course produce a negative as well as positive output, which would mean that this simulation of a simple cell would permit negative as well as positive firing. The response of each filter is zero thresholded and the negative results used to form a separate anti-phase input to the network. The filter outputs are also normalized across scales to compensate for the low frequency bias in the images of natural objects.</p>
<table-wrap position="float" id="TA4">
<label>Table A4</label>
<caption><p><bold>VisNet Layer 1 Connectivity</bold>.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left" valign="top">Frequency</td>
<td align="left" valign="top">0.5</td>
<td align="left" valign="top">0.25</td>
<td align="left" valign="top">0.125</td>
<td align="left" valign="top">0.0625</td>
</tr>
<tr>
<td align="left" valign="top">&#x00023; Connections</td>
<td align="left" valign="top">74</td>
<td align="left" valign="top">19</td>
<td align="left" valign="top">5</td>
<td align="left" valign="top">2</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The frequency is in cycles per pixel</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>The Gabor filters used were similar to those used previously (Deco and Rolls, <xref ref-type="bibr" rid="B17">2004</xref>). Following Daugman (<xref ref-type="bibr" rid="B16">1988</xref>) the receptive fields of the simple cell-like input neurons are modeled by 2D-Gabor functions. The Gabor receptive fields have five degrees of freedom given essentially by the product of an elliptical Gaussian and a complex plane wave. The first two degrees of freedom are the 2D-locations of the receptive field&#x00027;s center; the third is the size of the receptive field; the fourth is the orientation of the boundaries separating excitatory and inhibitory regions; and the fifth is the symmetry. This fifth degree of freedom is given in the standard Gabor transform by the real and imaginary part, i.e., by the phase of the complex function representing it, whereas in a biological context this can be done by combining pairs of neurons with even and odd receptive fields. This design is supported by the experimental work of Pollen and Ronner (<xref ref-type="bibr" rid="B75">1981</xref>), who found simple cells in quadrature-phase pairs. Even more, Daugman (<xref ref-type="bibr" rid="B16">1988</xref>) proposed that an ensemble of simple cells is best modeled as a family of 2D-Gabor wavelets sampling the frequency domain in a log-polar manner as a function of eccentricity. Experimental neurophysiological evidence constrains the relation between the free parameters that define a 2D-Gabor receptive field (De Valois and De Valois, <xref ref-type="bibr" rid="B22">1988</xref>). There are three constraints fixing the relation between the width, height, orientation, and spatial frequency (Lee, <xref ref-type="bibr" rid="B56">1996</xref>). The first constraint posits that the aspect ratio of the elliptical Gaussian envelope is 2:1. The second constraint postulates that the plane wave tends to have its propagating direction along the short axis of the elliptical Gaussian. The third constraint assumes that the half-amplitude bandwidth of the frequency response is about 1 to 1.5 octaves along the optimal orientation. Further, we assume that the mean is zero in order to have an admissible wavelet basis (Lee, <xref ref-type="bibr" rid="B56">1996</xref>).</p>
<p>In more detail, the Gabor filters are constructed as follows (Deco and Rolls, <xref ref-type="bibr" rid="B17">2004</xref>). We consider a pixelized grey-scale image given by a <italic>N</italic> &#x000D7; <italic>N</italic> matrix &#x00393;<sup>orig</sup><italic><sub>ij</sub></italic>. The subindices <italic>ij</italic> denote the spatial position of the pixel. Each pixel value is given a grey level brightness value coded in a scale between 0 (black) and 255 (white). The first step in the preprocessing consists of removing the DC component of the image (i.e., the mean value of the grey-scale intensity of the pixels). (The equivalent in the brain is the low-pass filtering performed by the retinal ganglion cells and lateral geniculate cells. The visual representation in the LGN is essentially a contrast invariant pixel representation of the image, i.e., each neuron encodes the relative brightness value at one location in visual space referred to the mean value of the image brightness.) We denote this contrast-invariant LGN representation by the <italic>N</italic> &#x000D7; <italic>N</italic> matrix &#x00393;<sub><italic>ij</italic></sub> defined by the equation</p>
<disp-formula id="E7"><label>(A7)</label><mml:math id="M7"><mml:mrow><mml:msub><mml:mi>&#x00393;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mi>&#x00393;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mtext>orig</mml:mtext></mml:mrow></mml:msubsup><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:msup><mml:mi>N</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:msubsup><mml:mi>&#x00393;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mtext>orig</mml:mtext></mml:mrow></mml:msubsup></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>
<p>Feedforward connections to a layer of V1 neurons perform the extraction of simple features like bars at different locations, orientations and sizes. Realistic receptive fields for V1 neurons that extract these simple features can be represented by 2D-Gabor wavelets. Lee (<xref ref-type="bibr" rid="B56">1996</xref>) derived a family of discretized 2D-Gabor wavelets that satisfy the wavelet theory and the neurophysiological constraints for simple cells mentioned above. They are given by an expression of the form</p>
<disp-formula id="E8"><label>(A8)</label><mml:math id="M8"><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>q</mml:mi><mml:mi>k</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mi>&#x003A8;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x00398;</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>2</mml:mn><mml:mi>p</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>2</mml:mn><mml:mi>q</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where
<disp-formula id="E9"><label>(A9)</label><mml:math id="M9"><mml:mrow><mml:msub><mml:mi>&#x003A8;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x00398;</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003A8;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mi>cos</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>l</mml:mi><mml:msub><mml:mi>&#x00398;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>y</mml:mi><mml:mi>sin</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>l</mml:mi><mml:msub><mml:mi>&#x00398;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>x</mml:mi><mml:mi>sin</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>l</mml:mi><mml:msub><mml:mi>&#x00398;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>y</mml:mi><mml:mi>cos</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>l</mml:mi><mml:msub><mml:mi>&#x00398;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
and the mother wavelet is given by</p>
<disp-formula id="E10"><label>(A10)</label><mml:math id="M10"><mml:mrow><mml:mi>&#x003A8;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:msqrt><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>8</mml:mn></mml:mfrac><mml:mo stretchy='false'>(</mml:mo><mml:mn>4</mml:mn><mml:msup><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>&#x003BA;</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msup><mml:mo>&#x02212;</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi>&#x003BA;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:msup><mml:mo stretchy='false'>]</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>
<p>In the above equations &#x00398;<sub>0</sub> &#x0003D; &#x003C0;/<italic>L</italic> denotes the step size of each angular rotation; <italic>l</italic> the index of rotation corresponding to the preferred orientation &#x00398;<sub><italic>l</italic></sub> &#x0003D; <italic>l</italic>&#x003C0;/<italic>L</italic>; <italic>k</italic> denotes the octave; and the indices <italic>pq</italic> the position of the receptive field center at <italic>c<sub>x</sub></italic> &#x0003D; <italic>p</italic> and <italic>c<sub>y</sub></italic> &#x0003D; <italic>q</italic>. In this form, the receptive fields at all levels cover the spatial domain in the same way, i.e., by always overlapping the receptive fields in the same fashion. In the model we use <italic>a</italic>&#x0003D;2, <italic>b</italic>&#x0003D;1 and &#x003BA;&#x0003D;&#x003C0; corresponding to a spatial frequency bandwidth of one octave. We used symmetric filters with the angular spacing between the different orientations set to 45 degrees; and with 4 filter frequencies spaced one octave apart starting with 0.5 cycles per pixel, and with the sampling from the spatial frequencies set as shown in Table <xref ref-type="table" rid="TA4">A4</xref>.</p>
<p>Cells of layer 1 receive a topologically consistent, localized, random selection of the filter responses in the input layer, under the constraint that each cell samples every filter spatial frequency and receives a constant number of inputs.</p>
</sec>
<sec>
<title>A.5 Measures for network performance</title>
<sec>
<title>A.5.1 Information theory measures</title>
<p>A neuron can be said to have learnt an invariant representation if it discriminates one set of stimuli from another set, across all transforms. For example, a neuron&#x00027;s response is translation invariant if its response to one set of stimuli irrespective of presentation is consistently higher than for all other stimuli irrespective of presentation location. Note that we state &#x02018;set of stimuli&#x02019; since neurons in the inferior temporal cortex are not generally selective for a single stimulus but rather a subpopulation of stimuli (Baylis et al., <xref ref-type="bibr" rid="B9">1985</xref>; Abbott et al., <xref ref-type="bibr" rid="B1">1996</xref>; Rolls et al., <xref ref-type="bibr" rid="B104">1997a</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B102">1998</xref>; Rolls and Deco, <xref ref-type="bibr" rid="B93">2002</xref>; Franco et al., <xref ref-type="bibr" rid="B31">2007</xref>; Rolls, <xref ref-type="bibr" rid="B83">2007</xref>, <xref ref-type="bibr" rid="B84">2008</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>). We used measures of network performance (Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>) based on information theory and similar to those used in the analysis of the firing of real neurons in the brain (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>). A single cell information measure was introduced which is the maximum amount of information the cell has about any one object independently of which transform (here position on the retina and view) is shown. Because the competitive algorithm used in VisNet tends to produce local representations (in which single cells become tuned to one stimulus or object), this information measure can approach log<sub>2</sub> <italic>N<sub>S</sub></italic> bits, where <italic>N<sub>S</sub></italic> is the number of different stimuli. Indeed, it is an advantage of this measure that it has a defined maximal value, which enables how well the network is performing to be quantified. Rolls and Milward (<xref ref-type="bibr" rid="B95">2000</xref>) also introduced a multiple cell information measure used here, which has the advantage that it provides a measure of whether all stimuli are encoded by different neurons in the network. Again, a high value of this measure indicates good performance.</p>
<p>For completeness, we provide further specification of the two information theoretic measures, which are described in detail by Rolls and Milward (<xref ref-type="bibr" rid="B95">2000</xref>) (see Rolls, <xref ref-type="bibr" rid="B84">2008</xref> and Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref> for an introduction to the concepts). The measures assess the extent to which either a single cell, or a population of cells, responds to the same stimulus invariantly with respect to its location, yet responds differently to different stimuli. The measures effectively show what one learns about which stimulus was presented from a single presentation of the stimulus at any randomly chosen location. Results for top (4th) layer cells are shown. High information measures thus show that cells fire similarly to the different transforms of a given stimulus (object), and differently to the other stimuli. The single cell stimulus-specific information, <italic>I(s,R)</italic>, is the amount of information the set of responses, <italic>R</italic>, has about a specific stimulus, <italic>s</italic> (see Rolls et al., <xref ref-type="bibr" rid="B105">1997b</xref> and Rolls and Milward, <xref ref-type="bibr" rid="B95">2000</xref>). <italic>I(s,R)</italic> is given by</p>
<disp-formula id="E11"><label>(A11)</label><mml:math id="M11"><mml:mrow><mml:mi>I</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>R</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>R</mml:mi></mml:mrow></mml:munder><mml:mi>P</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mfrac><mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>r</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>where <italic>r</italic> is an individual response from the set of responses <italic>R</italic> of the neuron. For each cell the performance measure used was the maximum amount of information a cell conveyed about any one stimulus. This (rather than the mutual information, <italic>I(S,R)</italic> where <italic>S</italic> is the whole set of stimuli <italic>s</italic>), is appropriate for a competitive network in which the cells tend to become tuned to one stimulus. (<italic>I(s,R)</italic> has more recently been called the stimulus-specific surprise (DeWeese and Meister, <xref ref-type="bibr" rid="B23">1999</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>). Its average across stimuli is the mutual information <italic>I(S,R)</italic>.)</p>
<p>If all the output cells of VisNet learned to respond to the same stimulus, then the information about the set of stimuli <italic>S</italic> would be very poor, and would not reach its maximal value of log<sub>2</sub> of the number of stimuli (in bits). The second measure that is used here is the information provided by a set of cells about the stimulus set, using the procedures described by Rolls et al. (<xref ref-type="bibr" rid="B104">1997a</xref>) and Rolls and Milward (<xref ref-type="bibr" rid="B95">2000</xref>). The multiple cell information is the mutual information between the whole set of stimuli <italic>S</italic> and of responses <italic>R</italic> calculated using a decoding procedure in which the stimulus <italic>s</italic>&#x00027; that gave rise to the particular firing rate response vector on each trial is estimated. [The decoding step is needed because the high dimensionality of the response space would lead to an inaccurate estimate of the information if the responses were used directly, as described by Rolls et al. (<xref ref-type="bibr" rid="B104">1997a</xref>) and Rolls and Treves (<xref ref-type="bibr" rid="B102">1998</xref>).] A probability table is then constructed of the real stimuli <italic>s</italic> and the decoded stimuli <italic>s</italic>&#x00027;. From this probability table, the mutual information between the set of actual stimuli <italic>S</italic> and the decoded estimates <italic>S</italic>&#x00027; is calculated as</p>
<disp-formula id="E12"><label>(A12)</label><mml:math id="M12"><mml:mrow><mml:mi>I</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mrow></mml:munder><mml:mi>P</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mrow><mml:mi>log</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msub><mml:mfrac><mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mi>P</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>This was calculated for the subset of cells which had as single cells the most information about which stimulus was shown. In particular, in Rolls and Milward (<xref ref-type="bibr" rid="B95">2000</xref>) and subsequent papers, the multiple cell information was calculated from the first five cells for each stimulus that had maximal single cell information about that stimulus, that is from a population of 35 cells if there were seven stimuli (each of which might have been shown in for example 9 or 25 positions on the retina).</p>
</sec>
<sec>
<title>A.5.2 Pattern association decoding</title>
<p>The output of the inferior temporal visual cortex reaches structures such as the orbitofrontal cortex and amygdala, where associations to other stimuli are learned by a pattern association network with an associative (Hebbian) learning rule (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B86">2014</xref>). We therefore used a one-layer pattern association network (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>) to measure how well the output of VisNet could be classified into one of the objects. The pattern association network had four output neurons, one for each object. The inputs were the ten neurons from layer 4 of VisNet for each of the four objects with the best single cell information, making 40 inputs to each neuron. The network was trained with the Hebb rule:</p>
<disp-formula id="E13"><label>(A13)</label><mml:math id="M13"><mml:mrow><mml:mi>&#x003B4;</mml:mi><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></disp-formula>
<p>where &#x003B4;<italic>w<sub>ij</sub></italic> is the change of the synaptic weight <italic>w<sub>ij</sub></italic> that results from the simultaneous (or conjunctive) presence of presynaptic firing <italic>x<sub>j</sub></italic> and postsynaptic firing or activation <italic>y<sub>i</sub></italic>, and &#x003B1; is a learning rate constant that specifies how much the synapses alter on any one pairing. The pattern associator was trained for one trial on the output of VisNet produced by every transform of each object.</p>
<p>Performance on the test images extracted from the scenes was tested by presenting an image to VisNet, and then measuring the classification produced by the pattern associator. Performance was measured by the percentage of the correct classifications of an image as the correct object.</p>
<p>This approach to measuring the performance is very biologically appropriate, for it models the type of learning thought to be implemented in structures that receive information from the inferior temporal visual cortex such as the orbitofrontal cortex and amygdala (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>, <xref ref-type="bibr" rid="B86">2014</xref>). The small number of neurons selected from layer 4 of VisNet might correspond to the most selective for this stimulus set in a sparse distributed representation (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>; Rolls and Treves, <xref ref-type="bibr" rid="B103">2011</xref>). The method would measure whether neurons of the type recorded in the inferior temporal visual cortex with good view and position invariance are developed in VisNet. In fact, an appropriate neuron for an input to such a decoding mechanism might have high firing rates to all or most of the view and position transforms of one of the stimuli, and smaller or no responses to any of the transforms of other objects, as found in the inferior temporal cortex for some neurons (Hasselmo et al., <xref ref-type="bibr" rid="B42">1989</xref>; Perrett et al., <xref ref-type="bibr" rid="B71">1991</xref>; Booth and Rolls, <xref ref-type="bibr" rid="B12">1998</xref>), and as illustrated for VisNet layer 4 neuron in this investigation in Figure <xref ref-type="fig" rid="F5">5B</xref>. Moreover, it would be inappropriate to train a device such as a support vector machine or even an error correction perceptron on the outputs of all the neurons in layer 4 of VisNet to produce 4 classifications, for such learning procedures, not biologically plausible (Rolls, <xref ref-type="bibr" rid="B84">2008</xref>), could map the responses produced by a multilayer network with untrained random weights to obtain good classifications.</p>
</sec>
</sec>
</app>
</app-group>
</back>
</article>
