<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Neurosci.</journal-id>
<journal-title>Frontiers in Computational Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-5188</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fncom.2018.00057</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Perceptual Dominance in Brief Presentations of Mixed Images: Human Perception vs. Deep Neural Networks</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Gruber</surname> <given-names>Liron Z.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<xref ref-type="author-notes" rid="fn001"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/543661/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Haruvi</surname> <given-names>Aia</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="author-notes" rid="fn001"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/567212/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Basri</surname> <given-names>Ronen</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Irani</surname> <given-names>Michal</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/583815/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Neurobiology, Weizmann Institute of Science</institution>, <addr-line>Rehovot</addr-line>, <country>Israel</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Computer Science and Applied Mathematics, Weizmann Institute of Science</institution>, <addr-line>Rehovot</addr-line>, <country>Israel</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Jonathan D. Victor, Weill Cornell Medicine, Cornell University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Ning Qian, Columbia University, United States; Ruben Moreno-Bote, Universidad Pompeu Fabra, Spain</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Liron Z. Gruber <email>liron.gruber&#x00040;weizmann.ac.il</email></corresp>
<fn fn-type="other" id="fn001"><p>&#x02020;These authors have contributed equally to this work.</p></fn></author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>07</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>12</volume>
<elocation-id>57</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>04</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>03</day>
<month>07</month>
<year>2018</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2018 Gruber, Haruvi, Basri and Irani.</copyright-statement>
<copyright-year>2018</copyright-year>
<copyright-holder>Gruber, Haruvi, Basri and Irani</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Visual perception involves continuously choosing the most prominent inputs while suppressing others. Neuroscientists induce visual competitions in various ways to study why and how the brain makes choices of what to perceive. Recently deep neural networks (DNNs) have been used as models of the ventral stream of the visual system, due to similarities in both accuracy and hierarchy of feature representation. In this study we created non-dynamic visual competitions for humans by briefly presenting mixtures of two images. We then tested feed-forward DNNs with similar mixtures and examined their behavior. We found that both humans and DNNs tend to perceive only one image when presented with a mixture of two. We revealed image parameters which predict this perceptual dominance and compared their predictability for the two visual systems. Our findings can be used to both improve DNNs as models, as well as potentially improve their performance by imitating biological behaviors.</p></abstract>
<kwd-group>
<kwd>deep neural networks</kwd>
<kwd>object recognition</kwd>
<kwd>visual perception</kwd>
<kwd>vision</kwd>
<kwd>visual competition</kwd>
</kwd-group>
<contract-sponsor id="cn001">Weizmann Institute of Science<named-content content-type="fundref-id">10.13039/501100001735</named-content></contract-sponsor>
<counts>
<fig-count count="7"/>
<table-count count="1"/>
<equation-count count="4"/>
<ref-count count="52"/>
<page-count count="10"/>
<word-count count="6838"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>These days, the leading algorithms for many computer vision tasks, and also for modeling the visual system specifically, are Deep Neural Networks (DNNs). DNNs are a class of computer learning algorithms that have become widely used in recent years (Lecun et al., <xref ref-type="bibr" rid="B24">2015</xref>). Interestingly, some current DNNs demonstrate a surprising degree of generalization to a variety of other visual tasks (Hue et al., <xref ref-type="bibr" rid="B16">2016</xref>). DNNs that are trained for image recognition (Russakovsky et al., <xref ref-type="bibr" rid="B39">2015</xref>) are found to be useful also in solving totally different visual tasks (Yosinski et al., <xref ref-type="bibr" rid="B51">2014</xref>). These general-purpose algorithms are suggested to be computationally similar to biological visual systems, even more so than less biologically plausible simulations (Kriegeskorte, <xref ref-type="bibr" rid="B19">2015</xref>; Yamins and Dicarlo, <xref ref-type="bibr" rid="B49">2016</xref>).</p>
<p>Moreover, image representation may be similar in trained DNNs and in biological visual systems. A recent study found that humans and DNNs largely agree on the relative difficulties of variations of images (Kheradpisheh et al., <xref ref-type="bibr" rid="B18">2016</xref>). Researchers also found that when the same image is processed by DNNs and by humans or monkeys, the DNN computation stages are strong predictors of human fMRI, MEG, and monkey electrophysiology data collected from visual areas (Cadieu et al., <xref ref-type="bibr" rid="B6">2014</xref>; Khaligh et al., <xref ref-type="bibr" rid="B17">2014</xref>; Yamins et al., <xref ref-type="bibr" rid="B50">2014</xref>; G&#x000FC;&#x000E7;l&#x000FC; and van Gerven, <xref ref-type="bibr" rid="B12">2015</xref>; Cichy et al., <xref ref-type="bibr" rid="B8">2017</xref>; Seeliger et al., <xref ref-type="bibr" rid="B41">2017</xref>). A different study showed that the final DNN computation stage is even a strong predictor of human-perceived shape discrimination (Kubilius et al., <xref ref-type="bibr" rid="B21">2016</xref>). These studies also showed that the more accurate a DNN model is, the stronger its predictive power, challenging vision researchers to create more accurate models based on biological studies of vision.</p>
<p>Alongside with these similarities, the gap between DNNs visual processing and the biological one is still significant. Marking differences like robustness to manipulations (Geirhos et al., <xref ref-type="bibr" rid="B11">2017</xref>), causes of errors (Nguyen et al., <xref ref-type="bibr" rid="B34">2015</xref>), etc. is of great importance to this field (Moosavi-Dezfooli et al., <xref ref-type="bibr" rid="B30">2017</xref>). Exploring these differences by studying known visual phenomena in DNNs, enables both improving current models as well as studying the possible computational nature of the visual system (Rajalingham et al., <xref ref-type="bibr" rid="B38">2018</xref>). Informative phenomena usually involve some kind of challenge to the visual system&#x02014;multi-stability, illusions, partial informative images, etc. An example of a human visual phenomenon that was studied using computer vision algorithms, is the existence of Minimal Recognizable Configurations (MIRCS) for the human visual system (Ullman et al., <xref ref-type="bibr" rid="B47">2016</xref>). The differences in recognition rates and behavior between humans and the DNNs tested, shed light on the possible nature of this phenomenon. DNNs were also used to explain the emergence of lightness illusions (Corney and Lotto, <xref ref-type="bibr" rid="B9">2007</xref>), which suggest general conclusions about perception&#x00027;s computational nature. Another illusion that emerged from DNN training is the Muller-Lyer geometrical illusion of size (Zeman et al., <xref ref-type="bibr" rid="B52">2013</xref>).</p>
<p>Other perceptual phenomena that can be studied using DNNs are &#x0201C;visual competition&#x0201D; phenomena, where a few competing percepts are potentially perceived. Most visual competition phenomena are dynamic and involve fluctuation in perception throughout time. They are usually referred to as &#x0201C;multi-stable perception.&#x0201D; They are different from our task (detailed below) and more complex to model, as the main challenge is describing the fluctuations causes and dynamics. When perceptual grouping, for example, is not unique (as in the interpretation of Necker cube), a specifically designed DNN model can be used to describe the computation behind the changes in perception throughout time (Kudo et al., <xref ref-type="bibr" rid="B22">1999</xref>). A well-studied dynamic visual competition phenomenon is binocular rivalry. It occurs when dissimilar monocular stimuli are presented to the two eyes. Rather than perceiving a stable, single mixture of the two stimuli, one experiences alternations in perceptual awareness over time (Blake and Tong, <xref ref-type="bibr" rid="B4">2008</xref>). The neuronal source for these visual competition dynamics is still debatable, researches have revealed evidence in both early visual processing and in higher stages along the ventral stream (Logothetis et al., <xref ref-type="bibr" rid="B28">1996</xref>; Logothetis, <xref ref-type="bibr" rid="B27">1998</xref>; Polonsky et al., <xref ref-type="bibr" rid="B37">2000</xref>; Blake and Logothetis, <xref ref-type="bibr" rid="B3">2002</xref>; Tong et al., <xref ref-type="bibr" rid="B46">2006</xref>; Wilson, <xref ref-type="bibr" rid="B48">2003</xref>)</p>
<p>A biological plausible model for the duration of perceptual alterations was offered in (Laing and Chow, <xref ref-type="bibr" rid="B23">2002</xref>), and studies have shown that the cause for the dynamic switching could be both adaptation and noise-driven (Shpiro et al., <xref ref-type="bibr" rid="B43">2009</xref>). Noise-driven time alterations were further modeled using attractor models (Moreno-Bote et al., <xref ref-type="bibr" rid="B31">2007</xref>). Another dynamic multi-stable phenomenon is monocular rivalry, which differ from the binocular one in that the same image is now presented to both eyes. This time it is a superimposed image, and the clarity of the images it is made from fluctuates alternately in time (O&#x00027;Shea et al., <xref ref-type="bibr" rid="B36">2017</xref>). Another study showed that bi-stable perception is a form of Bayesian sampling, it further demonstrated that using a neural network, one can capture several aspects of experimental data (Moreno-Bote et al., <xref ref-type="bibr" rid="B32">2011</xref>). Whether the processes or computational basis under binocular and monocular rivalry are similar and how they differ is still studied to these days (O&#x00027;Shea et al., <xref ref-type="bibr" rid="B35">2009</xref>). In this study, as our task did not involve time, we are merely interested in studying the causes of the perceptual dominance occurring already in brief exposures to superimposed images.</p>
<p>Following this, different image parameters had been shown to affect these competing percepts of multi-stable phenomenon. Motion of objects, contrast, luminance, etc. influence these perceptual alternations (Logothetis et al., <xref ref-type="bibr" rid="B28">1996</xref>). Low-level effects were also shown in masking, where a target image is followed by or mixed with a mask (Alam et al., <xref ref-type="bibr" rid="B1">2014</xref>). Practical models predicting detectability were suggested based on the biological visual system (Bradley et al., <xref ref-type="bibr" rid="B5">2014</xref>) and even further tuned to natural image constrains (Sch&#x000FC;tt and Wichmann, <xref ref-type="bibr" rid="B40">2017</xref>).</p>
<p>In this study, we propose a different visual competition task by briefly presenting mixed images to both humans and pre-trained object recognition DNNs. Similar mixed images were used to study the effects of attention manipulations in a pre-trained DNN (Lindsay, <xref ref-type="bibr" rid="B25">2015</xref>; Lindsay and Miller, <xref ref-type="bibr" rid="B26">2017</xref>). The model was re-trained as a binary classifier and manipulated at different layers to test performance changes. We created a non-dynamic visual competition that enables a comparison with common recognition DNNs, without manipulating their architecture or their training. By mixing two target images we introduced a similar challenge for both the DNN (trained on regular images) and humans (briefly presented with the mixtures). Brief presentations are ideal for investigating early stages of perceptual competition (Carter and Cavanagh, <xref ref-type="bibr" rid="B7">2007</xref>), and eliminates effects of time that are generally not comparable with most DNNs. Inspired by visual competitions researches, we generated a static biological competition and compared biological and artificial visual sensitivities (Alam et al., <xref ref-type="bibr" rid="B1">2014</xref>). Our work does not model the dynamics of bi-stable perception, it is only a window into the perceptual preferences and the image parameters predicting visual sensitivities, as well as the evolution of the inner preferences throughout the DNNs layers.</p>
</sec>
<sec sec-type="methods" id="s2">
<title>2. Methods</title>
<sec>
<title>2.1. Data formation</title>
<p>To induce perceptual competition between two different visual stimuli that will enable us to test both human participants and DNNs algorithms we used ImageNet dataset (Russakovsky et al., <xref ref-type="bibr" rid="B39">2015</xref>). We chose 180 images from different categories from ImageNet validation set and created mixtures of images in two morphing methods (Figure <xref ref-type="fig" rid="F1">1</xref>). For the DNN we generated all pairwise mixtures, and humans were tested on one set of unique mixtures. In the first method, named &#x0201C;50/50,&#x0201D; we averaged the RGB values of all pixels in the two images (Figure <xref ref-type="fig" rid="F1">1</xref>, top row). In the second method, named &#x0201C;phs/mag,&#x0201D; we Fourier-Transformed each image to get its magnitude and phase values in the frequency domain, then used the magnitude of one image with the phase from the other image, and transformed back using the inverse Fourier-Transform to get the final mix (Figure <xref ref-type="fig" rid="F1">1</xref>, bottom row). The second morphing method was inspired by a known visual phenomenon, according to which humans are sensitive to the phase rather than the magnitude of frequencies in natural images (Thomson et al., <xref ref-type="bibr" rid="B45">2000</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Two data sets of mixed images were created using images from the validation set of ImageNet. <bold>(Top)</bold> Example of the 50/50 morphing method (see text). <bold>(Bottom)</bold> Example of the phs/mag morphing method (see text). <bold>(Middle)</bold> Example of images from the original set.</p></caption>
<graphic xlink:href="fncom-12-00057-g0001.tif"/>
</fig>
</sec>
<sec>
<title>2.2. DNN output classification</title>
<p>To decide which original image &#x0201C;wins&#x0201D; the visual competition, or which image is &#x0201C;chosen&#x0201D; by the network to be &#x0201C;perceived,&#x0201D; we used the two sets of mixed images as inputs to pre-trained feed forward convolutional neural networks (Figures <xref ref-type="fig" rid="F2">2A,B</xref>)&#x02014;VGG19 (Krizhevsky et al., <xref ref-type="bibr" rid="B20">2012</xref>; Simonyan and Zisserman, <xref ref-type="bibr" rid="B44">2014</xref>) and ResNet (He et al., <xref ref-type="bibr" rid="B13">2016</xref>). We chose VGG19 as a representative network based on its high performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). We preferred VGG19 over other similar networks due to its relatively high accuracy rate when tested on our dataset [Top5 accuracy: AlexNet-0.77, VGG S-0.83, VGG16-0.90, VGG19-0.92 (Krizhevsky et al., <xref ref-type="bibr" rid="B20">2012</xref>; Simonyan and Zisserman, <xref ref-type="bibr" rid="B44">2014</xref>)]. We also validated our results using ResNet (which achieved even higher accuracies than the above networks in ILSVRC, and a Top5 accuracy 0.92 on our dataset), but here we present the results of VGG19 as it is more similar in depth and architecture to the networks used in previous studies presenting the similarities to the primate ventral stream (Cadieu et al., <xref ref-type="bibr" rid="B6">2014</xref>; Yamins et al., <xref ref-type="bibr" rid="B50">2014</xref>; Kubilius et al., <xref ref-type="bibr" rid="B21">2016</xref>; Yamins and Dicarlo, <xref ref-type="bibr" rid="B49">2016</xref>). We then compared the output probability vectors of the SoftMax layer when the input was each one of the original images and when the input was their mix. We classified the output vectors of the mixed images to four types of scenarios (Figure <xref ref-type="fig" rid="F2">2C</xref>): the network did not choose any of the images; it chose the first image; the second image; or both of them. We defined &#x0201C;choosing an image&#x0201D; based on the top <italic>N</italic> categories in the output probability vectors: if one of the top <italic>N</italic> categories of the mixed image is also one of the top <italic>N</italic> categories of an original image&#x02014;we say that the network chose to see this original image. In other words, we look for the top <italic>N</italic> categories of the mixed image in each of its two original images top <italic>N</italic> categories, if found&#x02014;we consider that original image &#x0201C;chosen.&#x0201D; In this study we mainly used <italic>N</italic> &#x0003D; 5, as it is leading metric when testing classification DNNs with 1,000 categories, due to the use of over-specific categories in the data set. ImageNet is a single label dataset containing images that can fall into several categories and the order of those categories is ambiguous. Moreover, we show the network choices for <italic>N</italic> &#x0003D; 2 as well, which is the smallest relevant <italic>N</italic> for this task. We have verified that using a different <italic>N</italic> within this range did not change the preceding analysis, as the dominance of choosing one is highly similar for <italic>N</italic> &#x0003D; 2 and <italic>N</italic> &#x0003D; 5, and it does not change the winning image within each pair (red curve in Figure <xref ref-type="fig" rid="F2">2D</xref>). We randomly sampled 90 mixed images and calculated the probability of each scenario (none, choose one image, both). For each <italic>N</italic>, we averaged these probabilities over 100 iterations. To account for the stochastic nature of human choices (Moreno-Bote et al., <xref ref-type="bibr" rid="B31">2007</xref>, <xref ref-type="bibr" rid="B32">2011</xref>), we further calculated the network choices when injected with Gaussian noise in the last layer before the SoftMax. Hence, the output layer is given by:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msub class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>We again averaged over 100 iterations, with changing the standard deviation of the noise (&#x003C3;) from 0 to 5. We present the level of noise that best resembled human choices. We have further verified that using the noise-injected results did not change the preceding analysis, similar to using top2 accuracy, as explained above.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p><bold>(A)</bold> One hundred and eighty original images and all pairwise mixtures between them were used as inputs to a pre-trained convolutional neural network (VGG19). <bold>(B)</bold> The network architecture. <bold>(C)</bold> Four possible softmax outputs when inserting a mixed image as an input (see text). <bold>(D)</bold> Detection threshold for output classification. The network choice was defined based on the overlapping top <italic>N</italic> categories of the original images and the mixed image (see text).</p></caption>
<graphic xlink:href="fncom-12-00057-g0002.tif"/>
</fig>
</sec>
<sec>
<title>2.3. Human experiment</title>
<p>The 180 images were uniquely paired to avoid repetitions that might cause memory biases. The 90 pairs were randomly divided to three groups of 30 mixtures each, yielding six conditions (three of the &#x0201C;50/50&#x0201D; and three of the &#x0201C;phs/mag,&#x0201D; <ext-link ext-link-type="uri" xlink:href="https://github.com/lirongruber/Visual-Competition/tree/master/human%20experiment/img">github.com/lirongruber/Visual-Competition/tree/master/human%20experiment/img</ext-link>). We used Amazon Mechanical Turk to test 600 participants in an on-line experiment, 100 per condition (participants were 36.6 &#x000B1; 10.6 years old, 303 of them were males). Ethics approval was obtained by the IRB (institutional review board) of the Weizmann institute of science. Each participant signed an informed consent form before participation and was paid 0.5$.</p>
<p>Each trial began with 1 second of fixation (&#x0002B; at the screen center) followed by the brief image presentation. We presented the mixed images to participants for 100 ms (different browsers cause jitters of 7.5 &#x000B1; 0.7 ms), as this brief exposure allows full recognition of regular images, while challenges the recognition of objects in the mixed images (Sheinberg and Logothetis, <xref ref-type="bibr" rid="B42">1997</xref>; Cadieu et al., <xref ref-type="bibr" rid="B6">2014</xref>). This time frame is commonly used in similar studies as it eliminates the effect of eye movements which enable humans to resample the image and impair the comparison (see Fig2S in Cadieu et al., <xref ref-type="bibr" rid="B6">2014</xref>; Rajalingham et al., <xref ref-type="bibr" rid="B38">2018</xref>).</p>
<p>Each trial ended with a free written report, usually between one to three words. Participants were instructed to report the object or objects they perceived, or type &#x0201C;none&#x0201D; if no object was recognized (empty reports were not accepted). Even though the networks rank 1,000 pre-determined categories, the open report is a better comparison than providing humans with a long list of options. An open report allows more authentic recognition answers, by not providing hints, not encouraging guessing and allowing the &#x0201C;none&#x0201D; option. Alternative solution as proposed in Kubilius et al. (<xref ref-type="bibr" rid="B21">2016</xref>) shortens the list but still has the above weaknesses of a closed report. Each written report was manually encoded to one of the four types of scenarios (Figure <xref ref-type="fig" rid="F2">2C</xref>). Decisions were made separately by two independent examiners, and the few disagreements were discarded (1.1%).</p>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3. Results</title>
<sec>
<title>3.1. Comparing DNN and humans choices</title>
<p>We calculated the probability of both humans and the DNN to perceive either one image, both, or none of them. Figure <xref ref-type="fig" rid="F3">3A</xref> shows the results of the 50/50 dataset and Figure <xref ref-type="fig" rid="F3">3B</xref> shows the results of the phs/mag dataset, for VGG-19.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Histograms of choices classification. <bold>(A)</bold> DNN&#x00027;s top5, noise-injected-top5, top2 and humans&#x00027; reports probability when observing the 50/50 dataset. <bold>(B)</bold> DNN&#x00027;s top5, noise-injected-top5, top2 and humans&#x00027; reports probability when observing the phs/mag dataset.</p></caption>
<graphic xlink:href="fncom-12-00057-g0003.tif"/>
</fig>
<p>For the 50/50 case, humans reported recognizing only one image in 70.5 &#x000B1; 1.6% of the trials. Similarly, the DNN chose only one image and suppressed the other in 76.5 &#x000B1; 0.5% (ResNet&#x02014;74.2 &#x000B1; 0.4%) for <italic>N</italic> &#x0003D; 5 and 74.5 &#x000B1; 0.4% for <italic>N</italic> &#x0003D; 2. For <italic>N</italic> &#x0003D; 5, the DNN successfully recognized the two images in 17.4 &#x000B1; 0.4% (ResNet&#x02014;18.5 &#x000B1; 0.4%) of the trials and missed only 6.0 &#x000B1; 0.3% (ResNet&#x02014;7.1 &#x000B1; 0.4%). On the other hand, humans recognized both images only in 6.0 &#x000B1; 0.6% and reported not perceiving anything in 23.2 &#x000B1; 1.7% of the trials. When using <italic>N</italic> &#x0003D; 2, the DNN successfully recognized the two images only in 4.1 &#x000B1; 0.2% and missed 21.4 &#x000B1; 0.4%. While this seems to better replicate the human results, one has to keep in mind the problematic use of the top2 accuracy rate, as described in the section 2. In an attempt to account for the stochastic nature of human choices compared with the deterministic one of the network, we injected Gaussian noise before the SoftMax layer of the network (see section 2).We present the DNN results with noise STD &#x0003D; 2.25, which best resembled human results: 20.6 &#x000B1; 0.5% none, 68.8 &#x000B1; 0.05% choose one image, 10.0 &#x000B1; 0.3% both (Figure <xref ref-type="fig" rid="F3">3A</xref>).</p>
<p>On the other hand, in the phs/mag mixture, for <italic>N</italic> &#x0003D; 5, the DNN did not recognize any of the images in 59.6 &#x000B1; 0.4% (ResNet -53.6 &#x000B1; 0.4%) of the trials, while humans missed only 45.0 &#x000B1; 1.0% of trials. In the recognized trials, humans always perceive the phase image (54.7 &#x000B1; 1.0% of all trials) while the DNN is less sensitive to it (36.3 &#x000B1; 0.4% of all trials, ResNet&#x02014;42.1 &#x000B1; 0.3%). While humans could never see the magnitude image, the DNN had a few successful trials of choosing it or both images (4.0 &#x000B1; 0.1% of all trials, chance level is 2.0%, ResNet&#x02014;3.5 &#x000B1; 0.1%). Using top2 results or the noise-injected ones only further damaged the network success rates, increasing the number of unrecognized images(Figure <xref ref-type="fig" rid="F3">3B</xref>).</p>
</sec>
<sec>
<title>3.2. Single parameters predictability</title>
<p>Out of the mixtures that were perceived as one image (Figure <xref ref-type="fig" rid="F3">3A</xref>, middle bars), only in 79.0% of the trials the DNN and humans chose the same image (humans mode). To further characterize the differences between them, we extracted image parameters that may predict the DNN&#x00027;s and humans&#x00027; tendency to prefer specific images over others. Based on vision research dealing with perceptual dominance (Logothetis, <xref ref-type="bibr" rid="B27">1998</xref>; Blake and Logothetis, <xref ref-type="bibr" rid="B3">2002</xref>; Tong et al., <xref ref-type="bibr" rid="B46">2006</xref>; Blake and Tong, <xref ref-type="bibr" rid="B4">2008</xref>), we extracted 12 initial features (average red, blue, and green component, colorfulness, luminance, saturation, global contrast, local contrast, horizontal and vertical gradient, 2D gradient, low frequencies, high frequencies) and then chose the least correlated among them (Table <xref ref-type="table" rid="T1">1</xref>). We calculated the probability of an image to be chosen over another image, as a function of the ratio between their parameters. To quantify the predictability of each parameter we fitted the probability with a logistic regression model (as in Equation 2 for a single parameter <italic>i</italic>), where the model parameter (|&#x003B2;|) represents the degree of predictability. By knowing the value of a predictive parameter, one can estimate with high probability which image will be chosen.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Image parameters.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Parameter</bold></th>
<th valign="top" align="left"><bold>Description</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Gradient</td>
<td valign="top" align="left"><inline-formula><mml:math id="M2"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>i</mml:mi><mml:mi>x</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:munder><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x02207;</mml:mo><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
</tr>

<tr>
<td valign="top" align="left">Low frequencies</td>
<td valign="top" align="left"><inline-formula><mml:math id="M3"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>25</mml:mn><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mi>f</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:munderover><mml:mo>|</mml:mo><mml:mi>F</mml:mi><mml:mi>F</mml:mi><mml:mi>T</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:math></inline-formula></td>
</tr>

<tr>
<td valign="top" align="left">Luminance</td>
<td valign="top" align="left">&#x0003C; 0.299<italic>R</italic> &#x0002B; 0.587<italic>G</italic> &#x0002B; 0.114<italic>B</italic> &#x0003E;<sub><italic>pixels</italic></sub></td>
</tr>

<tr>
<td valign="top" align="left">Global contrast</td>
<td valign="top" align="left"><italic>std</italic>(0.299<italic>R</italic>&#x0002B;0.587<italic>G</italic>&#x0002B;0.114<italic>B</italic>)</td>
</tr>

<tr>
<td valign="top" align="left">Colorfulness</td>
<td valign="top" align="left"><inline-formula><mml:math id="M4"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>i</mml:mi><mml:mi>x</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> [YIQ coordinate system ]</td>
</tr>

<tr>
<td valign="top" align="left">Saturation</td>
<td valign="top" align="left"><inline-formula><mml:math id="M5"><mml:mo>&#x0003C;</mml:mo><mml:mfrac><mml:mrow><mml:mn>255</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mo>&#x0003E;</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>i</mml:mi><mml:mi>x</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As can be seen in Figure <xref ref-type="fig" rid="F4">4</xref>, the gradient and the low frequencies were good predictors for both humans&#x00027; (&#x003B2; &#x0003D; 1.38 &#x000B1; 0.06, &#x003B2; &#x0003D; 1.14 &#x000B1; 0.06, respectively) and the DNN&#x00027;s choices (&#x003B2; &#x0003D; 1.72 &#x000B1; 0.05, &#x003B2; &#x0003D; 1.11 &#x000B1; 0.04, respectively), and slightly better for the DNN in higher parameter ratios. The luminance was not at all predictive, again similarly for humans (&#x003B2; &#x0003D; 0.07 &#x000B1; 0.04) and the DNN (&#x003B2; &#x0003D; 0.04 &#x000B1; 0.03). Differences were found for global contrast which was a better predictor for humans (especially in low and high ratios, &#x003B2; &#x0003D; 0.73 &#x000B1; 0.05) compared to the DNN (&#x003B2; &#x0003D; 0.34 &#x000B1; 0.03), and colorfulness and saturation seem irrelevant for humans (&#x003B2; &#x0003D; 0.13 &#x000B1; 0.04, &#x003B2; &#x0003D; 0.02 &#x000B1; 0.04, respectively) while predicting to some extent the DNN&#x00027;s choices (&#x003B2; &#x0003D; 0.56 &#x000B1; 0.03, &#x003B2; &#x0003D; 0.47 &#x000B1; 0.03, respectively).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Image parameters as predictors for the DNN&#x00027;s and humans&#x00027; choices for the 50/50 dataset (red and blue, respectively). The probability to choose <italic>I</italic><sub>1</sub> vs. <inline-formula><mml:math id="M6"><mml:mfrac><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula>, where <italic>f</italic>(<italic>I</italic><sub>1</sub>) is the parameter value of <italic>I</italic><sub>1</sub>. X axis is log scaled. Error bars are the confidence intervals (95%) of a binomial distribution calculated with Clopper-Pearson method. Inner bar plots show &#x003B2; parameters of logistic regression (see text) for humans and DNN.</p></caption>
<graphic xlink:href="fncom-12-00057-g0004.tif"/>
</fig>
</sec>
<sec>
<title>3.3. Multiple parameters predictability</title>
<p>We next looked for combinations of parameters that could increase the predictability. We optimized a regularized generalized linear model (GLM) for each subset of our six parameters and calculated the average prediction accuracy. The regularization parameter was determined via cross validation. As the two classes were balanced [P(pick <italic>I</italic><sub>1</sub>) &#x0003D; P(pick <italic>I</italic><sub>2</sub>)] we optimized a non-biased model (intercept &#x0003D; 0).
<disp-formula id="E2"><label>(2)</label><mml:math id="M7"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>k</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:msub class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><italic>I</italic><sub>1</sub>, <italic>I</italic><sub>2</sub> are the images, <inline-formula><mml:math id="M8"><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula> is the ratio of parameter <italic>i</italic> between the images, and &#x003B2;<sub><italic>i</italic></sub> is the coefficient of parameter <italic>i</italic>. After the model was trained, the decision and accuracy were calculated using:
<disp-formula id="E3"><label>(3)</label><mml:math id="M9"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext></mml:mtd><mml:mtd><mml:mi>P</mml:mi><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext></mml:mtd><mml:mtd><mml:mi>P</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E4"><label>(4)</label><mml:math id="M10"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x02211;</mml:mo><mml:mo>|</mml:mo><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>|</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><italic>y</italic><sup><italic>model</italic></sup> is the model choice (1/0 for choosing the first/second image, respectively), <italic>y</italic><sup><italic>net</italic></sup> is the DNN choice, and <italic>N</italic> is the number of images in each test set.</p>
<p>Figure <xref ref-type="fig" rid="F5">5A</xref> shows the average accuracy of the best subset for one, two and six parameters. The best single parameter for both humans and the DNN was the gradient, which predicted the DNN&#x00027;s and humans&#x00027; choice in 77.2 and 74.0% of the cases, respectively. The best pair of parameters was different, for humans adding the low frequencies yielded 76.5% successes and for the DNN adding colorfulness reached 79.4%. The best accuracy achieved for the DNN was 81.0% and for humans 78.6%. In both cases, using all parameters was not significantly different than adding any third parameter.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Average accuracy in predicting the winning image using multi-dimensional GLM for the DNN <bold>(A)</bold> and humans <bold>(B)</bold>. The figures represent the best subset of parameters when using one or two parameters and the maximum accuracy when using all of them. Using all six parameters yielded the same result as adding any third one in both cases. Error bars represent standard errors.</p></caption>
<graphic xlink:href="fncom-12-00057-g0005.tif"/>
</fig>
</sec>
<sec>
<title>3.4. Activity throughout the DNN layers</title>
<sec>
<title>3.4.1. 50/50 mixed images</title>
<p>As we are also interested in where this kind of competition is resolved, we further examined the activity of the network throughout the process of categorization, before the last softmax layer. We compared the activity of each neuron in each layer of the network when &#x0201C;observing&#x0201D; each of the original images and their mix. We calculated the correlations between those activity maps and averaged them per layer. To understand where the network&#x00027;s &#x0201C;decision&#x0201D; occurred, we calculated the average activity map correlations when averaging the &#x0201C;winning&#x0201D; images separately from the &#x0201C;losing&#x0201D; images (Figure <xref ref-type="fig" rid="F6">6</xref>). For both cases, the correlations in the first layers were high (0.7/0.6), decreased as we went deeper into the net and increased toward the end. When looking at the difference between these correlations (Figure <xref ref-type="fig" rid="F6">6B</xref>), although a difference already existed in the first layers, it increased dramatically in the last three layers. Suprisingly, we did not find any effect before/after max pooling (layers 3, 6, 11, 16, 21). On the other hand, the dramatic increase occurs in the fully connected layers (layers 22, 23, 24).</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p><bold>(A)</bold> The average correlations between the activity maps of the 50/50 mixed image units and the winning/losing image (blue/red) units. <bold>(B)</bold> Differences between correlations of the winning and losing image [i.e., difference between the blue and red curves in <bold>(A)</bold>, respectively].</p></caption>
<graphic xlink:href="fncom-12-00057-g0006.tif"/>
</fig>
</sec>
<sec>
<title>3.4.2. phs/mag mixed images</title>
<p>Though most of the times the network did not recognize both images, we aim to understand whether there was a different response throughout the layers when it did recognize one of them. Therefore, we averaged separately the mixtures where the net chose the phase image, the magnitude image or neither. Figure <xref ref-type="fig" rid="F7">7</xref> shows the average correlations throughout the layers with the phase image (Figure <xref ref-type="fig" rid="F7">7A</xref>), the magnitude image (Figure <xref ref-type="fig" rid="F7">7B</xref>), and the difference between them (Figure <xref ref-type="fig" rid="F7">7C</xref>). According to Figure <xref ref-type="fig" rid="F7">7C</xref>, there is a big difference in favor of the phase image already in the first layers, but this cannot be used as a predictor as it happened also for images where the magnitude image &#x0201C;won&#x0201D; (red) or neither (yellow). In the cases where the &#x0201C;phase&#x0201D; image &#x0201C;won,&#x0201D; the decision occurred only toward the end, where we observed a higher difference between the correlation with the phase image and the correlation with the magnitude image.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>The average correlations between the activity maps of the phs/mag mixtures and the original phase image <bold>(A)</bold> or the magnitude image <bold>(B)</bold>. The difference between them is presented in <bold>(C)</bold>. The blue curves represent images where the network chose the phase image as the &#x0201C;winner,&#x0201D; the red is when the magnitude image &#x0201C;won&#x0201D; and the yellow is for the cases where neither &#x0201C;won.&#x0201D; Error bars represent standard errors.</p></caption>
<graphic xlink:href="fncom-12-00057-g0007.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>4. Discussion</title>
<p>To these days, a key challenge for neuroscientists is describing and understanding the nature of computation in the brain (Marr and Poggio, <xref ref-type="bibr" rid="B29">1976</xref>). The rising success of artificial DNNs in object recognition tasks raises new questions about their resemblance to computations in the human visual system. Does the similarity between the biological and artificial systems goes beyond high accuracy? This study asserts a connection between deep networks and the human visual processing mechanism, adding to a growing body of studies showing that DNNs can be used for modeling different phenomena of the visual system (Cadieu et al., <xref ref-type="bibr" rid="B6">2014</xref>; Khaligh et al., <xref ref-type="bibr" rid="B17">2014</xref>; Yamins et al., <xref ref-type="bibr" rid="B50">2014</xref>; G&#x000FC;&#x000E7;l&#x000FC; and van Gerven, <xref ref-type="bibr" rid="B12">2015</xref>; Kubilius et al., <xref ref-type="bibr" rid="B21">2016</xref>; Cichy et al., <xref ref-type="bibr" rid="B8">2017</xref>; Seeliger et al., <xref ref-type="bibr" rid="B41">2017</xref>). It further reveals still existing divergence for future model improving. In this study, we have created a non-dynamic human visual competition. When briefly presented with a mixture of two images humans tended to perceive only one image (70.7%). Remarkably, when testing DNNs on the same mixes, only one of the images appeared in the top5 categories of the DNN (VGG19&#x02014;76.3%, ResNet&#x02014;74.2%). Using the top5 categories is the leading evaluation metric for networks with 1,000 categories, and specifically when working with the ImageNet dataset. The categories of this dataset are over-specific as they contain types of animals and parts of objects (e.g., green mamba, Passerina cyanea, modem, nail, etc.). Some of the images may also fall into more than one category (e.g., the man on the boat from Figure <xref ref-type="fig" rid="F1">1</xref>). As our goal was to determine which of the images was better perceived, or better popped-up in the brief exposure, we accepted any human answer referring to any part of an image, as well as used the top5 categories of the network. Moreover, we have verified that evaluating the network perception by choosing top2 categories would not change the main tendency to perceive only one image. This result implicates that the &#x0201C;suppression&#x0201D; of the unperceived stimulus can be explained without any top-down processes, using only a feed-forward architecture. While referring to the network&#x00027;s output as perception is still controversial, we refer here to a narrower definition which is the task related categorization. Our visual task involves two stimuli competing for the system&#x00027;s perception&#x02014;whether biological or artificial. This comparison is powerful, as the exact same stimulus was presented to both humans and a DNN.</p>
<p>While using only the top2 categories seemed to cover-up the discrepancies in perceiving both images or none of them, we believe, for the reasons listed above, it is a worse candidate for comparison to humans. Although, when using top5 accuracy, one has to account for a discrepancy in performance. In the current dataset and using the top5 categories, the net recognized both images at almost three times the rate of humans (Figure <xref ref-type="fig" rid="F3">3A</xref>). One plausible source for this difference is the deterministic nature of the DNN, compared with the stochastic one of humans. Inspired by studies using noise to model human stochasticity (Daw et al., <xref ref-type="bibr" rid="B10">2006</xref>; Moreno-Bote et al., <xref ref-type="bibr" rid="B31">2007</xref>, <xref ref-type="bibr" rid="B32">2011</xref>), we examined the effect of injecting noise to the decision-making process of the network. We showed that adding noise before the last layer enabled reaching similar to human results. In other words, the disparities we have mentioned so far might result from the lack of stochasticity in the DNN. Important to mention, though, is that neither using top2 accuracy nor noise-injection changed the winning image within each pair. Hence, it strengthens the robustness of the tendency to perceive only one image, and cannot account for all following similarities and differences found in the preceding analysis.</p>
<p>Finally, we note that humans were better than the DNN at recognizing images in the phase/magnitude mixtures (Figure <xref ref-type="fig" rid="F3">3B</xref>), and that this advantage was mainly due to increased sensitivity to the image phase. This sensitivity was previously shown to reflect natural images variability (Thomson et al., <xref ref-type="bibr" rid="B45">2000</xref>), and our finding implies that the DNN model we used is lacking in this regard.</p>
<p>We further attempted to regress performance of both systems to image attributes. Our analysis revealed that frequencies, both high (as captured by the gradient) and low, are common predictors of humans&#x00027; and the DNN&#x00027;s choices. The influence of image gradient on human perception had been previously shown in different paradigms (Hollins, <xref ref-type="bibr" rid="B15">1980</xref>; Mueller and Blake, <xref ref-type="bibr" rid="B33">1989</xref>), here, we show that this sensitivity exists also for the DNN model. On the other hand, although commonly used in psychophysical studies, the luminance was not a good predictor for either the DNN or for humans. Global contrast was a good predictor only for human performance, which might be explained by the low resolution enforced by the short exposure, while colorfulness and saturation were predictive only for the DNN&#x00027;s choices. The DNN&#x00027;s sensitivity to colorfulness was also observed using a generalized linear model, which further emphasizes the gradient&#x00027;s role as the common and most predictive parameter.</p>
<p>The parameters which predicted performance similarly for both systems may now offer a platform on which computational explanations to human sensitivities may be tested. These visual sensitivities spontaneously emerge from training an artificial system for classification, suggesting a similar mechanism in biological systems. Parameters which predicted performance differently point to a possible disparity between the two perceptual implementations&#x02014;the biological and the artificial. These differences may aid vision researchers in developing more human-like artificial networks, e.g., reducing network&#x00027;s sensitivity to color by augmenting the training dataset with color manipulations. Alternatively, one can re-train the networks using the mixed images labeled with human&#x00027;s choices.</p>
<p>Finally, we attempted to resolve where in the computational process perceptual competition was resolved. The activity throughout the layers of the DNN indicates that a preference for the perceived image existed already in early processing levels, though the difference in the last layers increased dramatically. This late preference in the fully-connected layers was also observed in the phase/magnitude competition. This result is consistent with a previous study, showing that in neural networks trained for binary choices, information regarding both choices can be tracked throughout the layers (Balasubramani et al., <xref ref-type="bibr" rid="B2">2018</xref>). It is further consistent with the primary functions of the different layers, convolutional layers serve as feature extractors, while fully-connected layers are in charge for the classification (Hertel et al., <xref ref-type="bibr" rid="B14">2015</xref>).</p>
<p>Our results offer a two-fold benefit for future work. First, they can be used to improve the validity of DNNs as models, as well as boost their performance (by imitating biological behaviors). Second, testing DNNs outputs using manipulated inputs provide a new approach for vision researchers to study how the brain makes choices of what to perceive. In conclusion, this work is yet another step toward a valid computational model of the ventral stream of the visual system. The differences we found can be used for bridging the gaps between biological and artificial visual perception.</p>
</sec>
<sec id="s5">
<title>Data availability statement</title>
<p>The dataset generated for the human experiment and the results can be found in <ext-link ext-link-type="uri" xlink:href="https://github.com/lirongruber/Visual-Competition">https://github.com/lirongruber/Visual-Competition</ext-link>.</p>
</sec>
<sec id="s6">
<title>Author contributions</title>
<p>LG and AH designed the research, conducted the human experiment, analyzed the data and wrote the paper. RB and MI supervised the analysis and contributed by reviewing and editing the manuscript.</p>
<sec>
<title>Conflict of interest statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ack><p>We thank Ehud Ahissar for helpful comments and review, Ron Dekel for technical advices and support, and Guy Nelinger for insightful comments and editing. This work was supported by the Weizmann Institute of Science.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alam</surname> <given-names>M. M.</given-names></name> <name><surname>Vilankar</surname> <given-names>K. P.</given-names></name> <name><surname>Field</surname> <given-names>D. J.</given-names></name> <name><surname>Chandler</surname> <given-names>D. M.</given-names></name></person-group> (<year>2014</year>). <article-title>Local masking in natural images: a database and analysis</article-title>. <source>J. Vis.</source> <volume>14</volume>, <fpage>22</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1167/14.8.22</pub-id><pub-id pub-id-type="pmid">25074900</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Balasubramani</surname> <given-names>P. P.</given-names></name> <name><surname>Moreno-Bote</surname> <given-names>R.</given-names></name> <name><surname>Hayden</surname> <given-names>B. Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Using a simple neural network to delineate some principles of distributed economic choice</article-title>. <source>Front. Comput. Neurosci.</source> <volume>12</volume>:<fpage>22</fpage>. <pub-id pub-id-type="doi">10.3389/fncom.2018.00022</pub-id><pub-id pub-id-type="pmid">29643773</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blake</surname> <given-names>R.</given-names></name> <name><surname>Logothetis</surname> <given-names>N. K.</given-names></name></person-group> (<year>2002</year>). <article-title>Visual competition</article-title>. <source>Nat. Rev. Neurosci.</source> <volume>3</volume>, <fpage>13</fpage>&#x02013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1038/nrn701</pub-id><pub-id pub-id-type="pmid">11823801</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blake</surname> <given-names>R.</given-names></name> <name><surname>Tong</surname> <given-names>F.</given-names></name></person-group> (<year>2008</year>). <article-title>Binocular rivalry</article-title>. <source>Scholarpedia</source> <volume>3</volume>:<fpage>1578</fpage>. <pub-id pub-id-type="doi">10.4249/scholarpedia.1578</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bradley</surname> <given-names>C.</given-names></name> <name><surname>Abrams</surname> <given-names>J.</given-names></name> <name><surname>Geisler</surname> <given-names>W. S.</given-names></name></person-group> (<year>2014</year>). <article-title>Retina-V1 model of detectability across the visual field</article-title>. <source>J. Vis.</source> <volume>14</volume>, <fpage>22</fpage>-<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1167/14.12.22</pub-id><pub-id pub-id-type="pmid">25336179</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cadieu</surname> <given-names>C. F.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Yamins</surname> <given-names>D. L.</given-names></name> <name><surname>Pinto</surname> <given-names>N.</given-names></name> <name><surname>Ardila</surname> <given-names>D.</given-names></name> <name><surname>Solomon</surname> <given-names>E. A.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Deep neural networks rival the representation of primate IT cortex for core visual object recognition</article-title>. <source>PLoS Comput. Biol.</source> <volume>10</volume>:<fpage>e1003963</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1003963</pub-id><pub-id pub-id-type="pmid">25521294</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carter</surname> <given-names>O.</given-names></name> <name><surname>Cavanagh</surname> <given-names>P.</given-names></name></person-group> (<year>2007</year>). <article-title>Onset rivalry: brief presentation isolates an early independent phase of perceptual competition</article-title>. <source>PLoS ONE</source> <volume>2</volume>:<fpage>e343</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0000343</pub-id><pub-id pub-id-type="pmid">17406667</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martin Cichy</surname> <given-names>R.</given-names></name> <name><surname>Khosla</surname> <given-names>A.</given-names></name> <name><surname>Pantazis</surname> <given-names>D.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks</article-title>. <source>Neuroimage</source> <volume>153</volume>, <fpage>346</fpage>&#x02013;<lpage>358</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2016.03.063</pub-id><pub-id pub-id-type="pmid">27039703</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Corney</surname> <given-names>D.</given-names></name> <name><surname>Lotto</surname> <given-names>R. B.</given-names></name></person-group> (<year>2007</year>). <article-title>What are lightness illusions and why do we see them?</article-title>. <source>PLoS Comput. Biol.</source> <volume>3</volume>:<fpage>e180</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.0030180</pub-id><pub-id pub-id-type="pmid">17907795</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Daw</surname> <given-names>N. D.</given-names></name> <name><surname>O&#x00027;doherty</surname> <given-names>J. P.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name> <name><surname>Seymour</surname> <given-names>B.</given-names></name> <name><surname>Dolan</surname> <given-names>R. J.</given-names></name></person-group> (<year>2006</year>). <article-title>Cortical substrates for exploratory decisions in humans</article-title>. <source>Nature</source> <volume>441</volume>, <fpage>876</fpage>&#x02013;<lpage>879</lpage>. <pub-id pub-id-type="doi">10.1038/nature04766</pub-id><pub-id pub-id-type="pmid">16778890</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Geirhos</surname> <given-names>R.</given-names></name> <name><surname>Janssen</surname> <given-names>D. H.</given-names></name> <name><surname>Sch&#x000FC;tt</surname> <given-names>H. H.</given-names></name> <name><surname>Rauber</surname> <given-names>J.</given-names></name> <name><surname>Bethge</surname> <given-names>M.</given-names></name> <name><surname>Wichmann</surname> <given-names>F. A.</given-names></name></person-group> (<year>2017</year>). <article-title>Comparing deep neural networks against humans: object recognition when the signal gets weaker</article-title>. <source>arXiv preprint arXiv:1706.06969</source>.</citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x000FC;&#x000E7;l&#x000FC;</surname> <given-names>U.</given-names></name> <name><surname>van Gerven</surname> <given-names>M. A.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream</article-title>. <source>J. Neurosci.</source> <volume>35</volume>, <fpage>10005</fpage>&#x02013;<lpage>10014</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.5023-14.2015</pub-id><pub-id pub-id-type="pmid">26157000</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep residual learning for image recognition</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hertel</surname> <given-names>L.</given-names></name> <name><surname>Barth</surname> <given-names>E.</given-names></name> <name><surname>K&#x000E4;ster</surname> <given-names>T.</given-names></name> <name><surname>Martinetz</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep convolutional neural networks as generic feature extractors</article-title>, in <source>2015 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Killarney</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>4</lpage>.</citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hollins</surname> <given-names>M.</given-names></name></person-group> (<year>1980</year>). <article-title>The effect of contrast on the completeness of binocular rivalry suppression</article-title>. <source>Percept. Psychophys.</source> <volume>27</volume>, <fpage>550</fpage>&#x02013;<lpage>556</lpage>. <pub-id pub-id-type="doi">10.3758/BF03198684</pub-id><pub-id pub-id-type="pmid">7393703</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huh</surname> <given-names>M.</given-names></name> <name><surname>Agrawal</surname> <given-names>P.</given-names></name> <name><surname>Efros</surname> <given-names>A. A.</given-names></name></person-group> (<year>2016</year>). <article-title>What makes ImageNet good for transfer learning?</article-title>. <source>arXiv preprint arXiv:1608.08614</source>.</citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khaligh-Razavi</surname> <given-names>S. M.</given-names></name> <name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2014</year>). <article-title>Deep supervised, but not unsupervised, models may explain IT cortical representation</article-title>. <source>PLoS Comput. Biol.</source> <volume>10</volume>:<fpage>e1003915</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1003915</pub-id><pub-id pub-id-type="pmid">4222664</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kheradpisheh</surname> <given-names>S. R.</given-names></name> <name><surname>Ghodrati</surname> <given-names>M.</given-names></name> <name><surname>Ganjtabesh</surname> <given-names>M.</given-names></name> <name><surname>Masquelier</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>Humans and deep networks largely agree on which kinds of variation make object recognition harder</article-title>. <source>Front. Comput. Neurosci.</source> <volume>10</volume>:<fpage>92</fpage>. <pub-id pub-id-type="doi">10.3389/fncom.2016.00092</pub-id><pub-id pub-id-type="pmid">27642281</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kriegeskorte</surname> <given-names>N.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep neural networks: a new framework for modeling biological vision and brain information processing</article-title>. <source>Annu. Rev. Vis. Sci.</source> <volume>1</volume>, <fpage>417</fpage>&#x02013;<lpage>446</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-vision-082114-035447</pub-id><pub-id pub-id-type="pmid">28532370</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2012</year>). <article-title>Imagenet classification with deep convolutional neural networks</article-title>, in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Lake Tahoe, NV</publisher-loc>), <fpage>1097</fpage>&#x02013;<lpage>1105</lpage>.</citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kubilius</surname> <given-names>J.</given-names></name> <name><surname>Bracci</surname> <given-names>S.</given-names></name> <name><surname>de Beeck</surname> <given-names>H. P. O.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep neural networks as a computational model for human shape sensitivity</article-title>. <source>PLoS Comput. Biol.</source> <volume>12</volume>:<fpage>e1004896</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1004896</pub-id><pub-id pub-id-type="pmid">27124699</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kudo</surname> <given-names>H.</given-names></name> <name><surname>Yamamura</surname> <given-names>T.</given-names></name> <name><surname>Ohnishi</surname> <given-names>N.</given-names></name> <name><surname>Kobayashi</surname> <given-names>S.</given-names></name> <name><surname>Sugie</surname> <given-names>N.</given-names></name></person-group> (<year>1999</year>). <article-title>A neural network model of dynamically fluctuating perception of necker cube as well as dot patterns</article-title>, in <source>AAAI/IAAI</source> (<publisher-loc>Orlando, FL</publisher-loc>), <fpage>194</fpage>&#x02013;<lpage>199</lpage>.</citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Laing</surname> <given-names>C. R.</given-names></name> <name><surname>Chow</surname> <given-names>C. C.</given-names></name></person-group> (<year>2002</year>). <article-title>A spiking neuron model for binocular rivalry</article-title>. <source>J. Comput. Neurosci.</source> <volume>12</volume>, <fpage>39</fpage>&#x02013;<lpage>53</lpage>. <pub-id pub-id-type="doi">10.1023/A:1014942129705</pub-id><pub-id pub-id-type="pmid">11932559</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source> <volume>521</volume>, <fpage>436</fpage>&#x02013;<lpage>444</lpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id><pub-id pub-id-type="pmid">26017442</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lindsay</surname> <given-names>G. W.</given-names></name></person-group> (<year>2015</year>). <article-title>Feature-based attention in convolutional neural networks</article-title>. <source>arXiv preprint arXiv:1511.06408</source>.</citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lindsay</surname> <given-names>G. W.</given-names></name> <name><surname>Miller</surname> <given-names>K. D.</given-names></name></person-group> (<year>2017</year>). <article-title>Understanding biological visual attention using convolutional neural networks</article-title>. <source>bioRxiv 233338</source>. <pub-id pub-id-type="doi">10.1101/233338</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Logothetis</surname> <given-names>N. K.</given-names></name></person-group> (<year>1998</year>). <article-title>Single units and conscious vision</article-title>. <source>Philos. Trans. R. Soc. B Biol. Sci.</source> <volume>353</volume>, <fpage>1801</fpage>&#x02013;<lpage>1818</lpage>. <pub-id pub-id-type="doi">10.1098/rstb.1998.0333</pub-id><pub-id pub-id-type="pmid">9854253</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Logothetis</surname> <given-names>N. K.</given-names></name> <name><surname>Leopold</surname> <given-names>D. A.</given-names></name> <name><surname>Sheinberg</surname> <given-names>D. L.</given-names></name></person-group> (<year>1996</year>). <article-title>What is rivalling during binocular rivalry?</article-title> <source>Nature</source> <volume>380</volume>, <fpage>621</fpage>&#x02013;<lpage>624</lpage>. <pub-id pub-id-type="doi">10.1038/380621a0</pub-id><pub-id pub-id-type="pmid">8602261</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Marr</surname> <given-names>D.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name></person-group> (<year>1976</year>). <source>From Understanding Computation to Understanding Neural Circuitry.</source> M.I.T A.I.</citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moosavi-Dezfooli</surname> <given-names>S. M.</given-names></name> <name><surname>Fawzi</surname> <given-names>A.</given-names></name> <name><surname>Fawzi</surname> <given-names>O.</given-names></name> <name><surname>Frossard</surname> <given-names>P.</given-names></name></person-group> (<year>2017</year>). <article-title>Universal adversarial perturbations</article-title>. <source>arXiv preprint</source>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.17</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moreno-Bote</surname> <given-names>R.</given-names></name> <name><surname>Rinzel</surname> <given-names>J.</given-names></name> <name><surname>Rubin</surname> <given-names>N.</given-names></name></person-group> (<year>2007</year>). <article-title>Noise-induced alternations in an attractor network model of perceptual bistability</article-title>. <source>J. Neurophysiol.</source> <volume>98</volume>, <fpage>1125</fpage>&#x02013;<lpage>1139</lpage>. <pub-id pub-id-type="doi">10.1152/jn.00116.2007</pub-id><pub-id pub-id-type="pmid">17615138</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moreno-Bote</surname> <given-names>R.</given-names></name> <name><surname>Knill</surname> <given-names>D. C.</given-names></name> <name><surname>Pouget</surname> <given-names>A.</given-names></name></person-group> (<year>2011</year>). <article-title>Bayesian sampling in visual perception</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>108</volume>, <fpage>12491</fpage>&#x02013;<lpage>12496</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1101430108</pub-id><pub-id pub-id-type="pmid">21742982</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mueller</surname> <given-names>T. J.</given-names></name> <name><surname>Blake</surname> <given-names>R.</given-names></name></person-group> (<year>1989</year>). <article-title>A fresh look at the temporal dynamics of binocular rivalry</article-title>. <source>Biol. Cybern.</source> <volume>61</volume>, <fpage>223</fpage>&#x02013;<lpage>232</lpage>. <pub-id pub-id-type="doi">10.1007/BF00198769</pub-id><pub-id pub-id-type="pmid">2765591</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nguyen</surname> <given-names>A.</given-names></name> <name><surname>Yosinski</surname> <given-names>J.</given-names></name> <name><surname>Clune</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep neural networks are easily fooled: high confidence predictions for unrecognizable images</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Boston, MA</publisher-loc>), <fpage>427</fpage>&#x02013;<lpage>436</lpage>.</citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x00027;Shea</surname> <given-names>R. P.</given-names></name> <name><surname>Parker</surname> <given-names>A.</given-names></name> <name><surname>La Rooy</surname> <given-names>D.</given-names></name> <name><surname>Alais</surname> <given-names>D.</given-names></name></person-group> (<year>2009</year>). <article-title>Monocular rivalry exhibits three hallmarks of binocular rivalry: evidence for common processes</article-title>. <source>Vis. Res.</source> <volume>49</volume>, <fpage>671</fpage>&#x02013;<lpage>681</lpage>. <pub-id pub-id-type="doi">10.1016/j.visres.2009.01.020</pub-id><pub-id pub-id-type="pmid">19232529</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x00027;Shea</surname> <given-names>R. P.</given-names></name> <name><surname>Roeber</surname> <given-names>U.</given-names></name> <name><surname>Wade</surname> <given-names>N. J.</given-names></name></person-group> (<year>2017</year>). <article-title>On the discovery of monocular rivalry by Tscherning in 1898: translation and review</article-title>. <source>i-Perception</source> <volume>8</volume>:<fpage>2041669517743523</fpage>. <pub-id pub-id-type="doi">10.1177/2041669517743523</pub-id><pub-id pub-id-type="pmid">29225766</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Polonsky</surname> <given-names>A.</given-names></name> <name><surname>Blake</surname> <given-names>R.</given-names></name> <name><surname>Braun</surname> <given-names>J.</given-names></name> <name><surname>Heeger</surname> <given-names>D. J.</given-names></name></person-group> (<year>2000</year>). <article-title>Neuronal activity in human primary visual cortex correlates with perception during binocular rivalry</article-title>. <source>Nat. Neurosci.</source> <volume>3</volume>, <fpage>1153</fpage>&#x02013;<lpage>159</lpage>. <pub-id pub-id-type="doi">10.1038/80676</pub-id><pub-id pub-id-type="pmid">11036274</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rajalingham</surname> <given-names>R.</given-names></name> <name><surname>Issa</surname> <given-names>E. B.</given-names></name> <name><surname>Bashivan</surname> <given-names>P.</given-names></name> <name><surname>Kar</surname> <given-names>K.</given-names></name> <name><surname>Schmidt</surname> <given-names>K.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks</article-title>. <source>bioRxiv, 240614</source>. <pub-id pub-id-type="doi">10.1101/240614</pub-id><pub-id pub-id-type="pmid">30006365</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russakovsky</surname> <given-names>O.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Su</surname> <given-names>H.</given-names></name> <name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Satheesh</surname> <given-names>S.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Imagenet large scale visual recognition challenge</article-title>. <source>Int. J. Comput. Vis.</source> <volume>115</volume>, <fpage>211</fpage>&#x02013;<lpage>252</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-015-0816-y</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sch&#x000FC;tt</surname> <given-names>H. H.</given-names></name> <name><surname>Wichmann</surname> <given-names>F. A.</given-names></name></person-group> (<year>2017</year>). <article-title>An image-computable psychophysical spatial vision model</article-title>. <source>J. Vis.</source> <volume>17</volume>:<fpage>12</fpage>. <pub-id pub-id-type="doi">10.1167/17.12.12</pub-id><pub-id pub-id-type="pmid">29053781</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Seeliger</surname> <given-names>K.</given-names></name> <name><surname>Fritsche</surname> <given-names>M.</given-names></name> <name><surname>G&#x000FC;&#x000E7;l&#x000FC;</surname> <given-names>U.</given-names></name> <name><surname>Schoenmakers</surname> <given-names>S.</given-names></name> <name><surname>Schoffelen</surname> <given-names>J. M.</given-names></name> <name><surname>Bosch</surname> <given-names>S. E.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Convolutional neural network-based encoding and decoding of visual object recognition in space and time</article-title>. <source>Neuroimage</source> <volume>17</volume>, <fpage>30586</fpage>&#x02013;<lpage>30584</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2017.07.018</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sheinberg</surname> <given-names>D. L.</given-names></name> <name><surname>Logothetis</surname> <given-names>N. K.</given-names></name></person-group> (<year>1997</year>). <article-title>The role of temporal cortical areas in perceptual organization</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>94</volume>, <fpage>3408</fpage>&#x02013;<lpage>3413</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.94.7.3408</pub-id><pub-id pub-id-type="pmid">9096407</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shpiro</surname> <given-names>A.</given-names></name> <name><surname>Moreno-Bote</surname> <given-names>R.</given-names></name> <name><surname>Rubin</surname> <given-names>N.</given-names></name> <name><surname>Rinzel</surname> <given-names>J.</given-names></name></person-group> (<year>2009</year>). <article-title>Balance between noise and adaptation in competition models of perceptual bistability</article-title>. <source>J. Comput. Neurosci.</source> <volume>27</volume>, <fpage>37</fpage>&#x02013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1007/s10827-008-0125-3</pub-id><pub-id pub-id-type="pmid">19125318</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <source>arXiv preprint arXiv:1409.1556</source>.</citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thomson</surname> <given-names>M. G.</given-names></name> <name><surname>Foster</surname> <given-names>D. H.</given-names></name> <name><surname>Summers</surname> <given-names>R. J.</given-names></name></person-group> (<year>2000</year>). <article-title>Human sensitivity to phase perturbations in natural images: a statistical framework</article-title>. <source>Perception</source> <volume>29</volume>, <fpage>1057</fpage>&#x02013;<lpage>1069</lpage>. <pub-id pub-id-type="doi">10.1068/p2867</pub-id><pub-id pub-id-type="pmid">11144819</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tong</surname> <given-names>F.</given-names></name> <name><surname>Meng</surname> <given-names>M.</given-names></name> <name><surname>Blake</surname> <given-names>R.</given-names></name></person-group> (<year>2006</year>). <article-title>Neural bases of binocular rivalry</article-title>. <source>Trends Cogn. Sci.</source> <volume>10</volume>, <fpage>502</fpage>&#x02013;<lpage>511</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2006.09.003</pub-id><pub-id pub-id-type="pmid">16997612</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ullman</surname> <given-names>S.</given-names></name> <name><surname>Assif</surname> <given-names>L.</given-names></name> <name><surname>Fetaya</surname> <given-names>E.</given-names></name> <name><surname>Harari</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>Atoms of recognition in human and computer vision</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>113</volume>, <fpage>2744</fpage>&#x02013;<lpage>2749</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1513198113</pub-id><pub-id pub-id-type="pmid">26884200</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wilson</surname> <given-names>H. R.</given-names></name></person-group> (<year>2003</year>). <article-title>Computational evidence for a rivalry hierarchy in vision</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>100</volume>, <fpage>14499</fpage>&#x02013;<lpage>14503</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.2333622100</pub-id><pub-id pub-id-type="pmid">14612564</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamins</surname> <given-names>D. L.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2016</year>). <article-title>Using goal-driven deep learning models to understand sensory cortex</article-title>. <source>Nat. Neurosci.</source> <volume>19</volume>, <fpage>356</fpage>&#x02013;<lpage>365</lpage>. <pub-id pub-id-type="doi">10.1038/nn.4244</pub-id><pub-id pub-id-type="pmid">26906502</pub-id></citation></ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamins</surname> <given-names>D. L.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Cadieu</surname> <given-names>C. F.</given-names></name> <name><surname>Solomon</surname> <given-names>E. A.</given-names></name> <name><surname>Seibert</surname> <given-names>D.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2014</year>). <article-title>Performance-optimized hierarchical models predict neural responses in higher visual cortex</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>111</volume>, <fpage>8619</fpage>&#x02013;<lpage>8624</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1403112111</pub-id><pub-id pub-id-type="pmid">24812127</pub-id></citation></ref>
<ref id="B51">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yosinski</surname> <given-names>J.</given-names></name> <name><surname>Clune</surname> <given-names>J.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Lipson</surname> <given-names>H.</given-names></name></person-group> (<year>2014</year>). <article-title>How transferable are features in deep neural networks?</article-title>, in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>3320</fpage>&#x02013;<lpage>3328</lpage>.</citation></ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zeman</surname> <given-names>A.</given-names></name> <name><surname>Obst</surname> <given-names>O.</given-names></name> <name><surname>Brooks</surname> <given-names>K. R.</given-names></name> <name><surname>Rich</surname> <given-names>A. N.</given-names></name></person-group> (<year>2013</year>). <article-title>The M&#x000FC;ller-Lyer illusion in a computational model of biological object recognition</article-title>. <source>PLoS ONE</source> <volume>8</volume>:<fpage>e56126</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0056126</pub-id><pub-id pub-id-type="pmid">23457510</pub-id></citation></ref>
</ref-list> 
</back>
</article>