Perception Science in the Age of Deep Neural Networks

VanRullen, Rufin

doi:10.3389/fpsyg.2017.00142

SPECIALTY GRAND CHALLENGE article

Front. Psychol., 02 February 2017

Sec. Perception Science

Volume 8 - 2017 | https://doi.org/10.3389/fpsyg.2017.00142

This article is part of the Research TopicGrand Challenges for Psychological Science in the 21st CenturyView all 16 articles

Perception Science in the Age of Deep Neural Networks

Rufin VanRullen^1,2^*

¹Centre National de la Recherche Scientifique, UMR 5549, Faculté de Médecine Purpan, Toulouse, France
²Université de Toulouse, Centre de Recherche Cerveau et Cognition, Université Paul Sabatier, Toulouse, France

For decades, perception was considered a unique ability of biological systems, little understood in its inner workings, and virtually impossible to match in artificial systems. But this status quo was upturned in recent years, with dramatic improvements in computer models of perception brought about by “deep learning” approaches. What does all the ruckus about a “new dawn of artificial intelligence” imply for the neuroscientific and psychological study of perception? Is it a threat, an opportunity, or maybe a little of both?

While We Were Sleeping…

My personal journey in the field of perception science started about 20 years ago. For as long as I can remember, we perception scientists have exploited in our papers and grant proposals the lack of human-level artificial perception systems, both as a justification for scientific inquiry, and as a convenient excuse for using a cautious, methodical approach—i.e., “baby steps.” Visual object recognition, for example, seemed such an intractable problem that it was obviously more reasonable to study simple stimuli (e.g., Gabor patches), or to focus on highly specific sub-components of object recognition (e.g., symmetry invariance). But now neural networks, loosely inspired by the hierarchical architecture of the primate visual system, routinely outperform humans in object recognition tasks (Krizhevsky et al., 2012; Sermanet et al., 2013; Simonyan and Zisserman, 2014; He et al., 2015; Szegedy et al., 2015). Our excuse is gone—and yet we are still nowhere near a complete description and understanding of biological vision.

It would take a monastic life over the last 5 years to be fully unaware of the recent developments in machine learning and artificial intelligence. Things that robots could only do in science fiction movies can now be performed by our smartphones, sometimes without our even noticing. We talk to Siri, Cortana, Google Assistant, or Alexa; they understand, obey, and respond with naturalistic speech and an occasional joke. Any language can be comprehended and translated near-instantaneously (Johnson et al., 2016; van den Oord et al., 2016). The same methods that have been used to crack Natural Language Processing (NLP) have also been applied to the creation of novel music (Hadjeres and Pachet, 2016; van den Oord et al., 2016) (youtube.com/watch?v=LSHZ_b05W7o or youtu.be/QiBM7-5hA6o), or to writing new texts, from novels to TV show scripts to fake (but eerily credible) Donald Trump tweets (twitter.com/deepdrumpf). Chatbots based on these algorithms are set to replace humans in many online services.

The staggering “creativity” of machines is also expressed in the field of image processing and machine vision. Human-level object recognition networks trained by “deep learning” were only the beginning. Now complex scenes can be analyzed to precisely localize and identify each object and its relation to others, and to provide a natural text description, e.g., “two children are playing ball on the beach” (Karpathy and Fei-Fei, 2015; Vinyals et al., 2016). By inverting the analysis process (“deconvolution”), novel images can be synthesized, giving such networks the ability to “dream” (Mordvintsev et al., 2015), but also to perform useful image processing feats. You can take a portrait and make the person smile, or look younger (Figure 1). You can give a holiday picture and have it painted like a Renoir (Gatys et al., 2015; Dumoulin et al., 2016). You can input an old black-and-white photo and have it colorized (Isola et al., 2016; Zhang et al., 2016). You can give a 3-color doodle (“here goes the lake, here are some trees, and there is the sky”) and have a realistic photo synthesized (Champandard, 2016; Isola et al., 2016). You can give a line drawing and turn it into a real object (Isola et al., 2016). You can give a low-resolution picture and have its original resolution restored (Dong et al., 2015; Romano et al., 2016). You can give a text description, and have a novel, never-seen before picture generated from scratch (Mansimov et al., 2015; Nguyen et al., 2016). There does not seem to be any limit to what can be done, except for human imagination (and training datasets).

FIGURE 1

Figure 1. (A) “variational auto-encoder” (VAE) deep network (13 layers) was trained using an unsupervised “generative adversarial network” procedure (VAE/GAN, Goodfellow et al., 2014; Larsen et al., 2015) on a labeled database of 202,599 celebrity faces (15 epochs). The latent space (1024-dimensional) of the resulting network provides a description of numerous facial features that could approximate face representations in the human brain. (A) A picture of the author as seen (i.e., “encoded”) by the network is rendered (i.e., “decoded”) in the center of the panel. After encoding, the latent space can be sampled with simple linear algebra. For example, adding a “beard vector” (obtained by subtracting the average latent description of 1000 faces having a “no-beard” label from the average latent description of 1000 faces having a “beard” label) before decoding creates a realistic image of the author with a beard. The same operation can be done (clockwise, from right) by adding average vectors reflecting the labels “bald,” “old,” “young,” or “smile.” In short, the network manipulates concepts, which it can extract from and render to pixel-based representations. It is tempting to envision that the 1024 “hidden neurons” forming this latent space could display a pattern of stimulus selectivity comparable to that observed in certain human face-selective regions (Kanwisher et al., 1997; Tsao et al., 2006; Freiwald et al., 2009; Freiwald and Tsao, 2010). (B) Since the network (much like the human brain) was trained solely with upright faces, it inappropriately encodes an upside-down face, partly erasing important facial features (the mouth) and “hallucinating” inexistent features (a faint nose and mouth in the forehead region). This illustrates how human-like perceptual behavior (here, the face inversion effect) can emerge from computational principles. The database used for training this network is accessible from mmlab.ie.cuhk.edu.hk/projects/CelebA.html (Liu et al., 2015).

Meanwhile, the field of Perception Science still struggles to explain how sensory information is turned into meaningful concepts by the human (or animal) brain, let alone understanding imagination, or artistic creativity. This, then, is the rather pessimistic take on the impact of this machine learning revolution for Perception Science: It forces us to take a good, hard look at our slow progress. While we were arguing over the details, somebody figured out the big picture.

What Dreams May Come

But there are, of course, arguments against such a dark depiction. For one thing, machine learning still has a long way to go. There are many areas of perception science where deep neural networks (DNNs) haven't been applied yet, or have not yet met the anticipated success: For example, motion processing, ocular disparity and depth processing, color constancy, grouping and Gestalt laws, attention, perceptual multi-stability, or multi-sensory integration, just to name a few. On the other hand, it can be mathematically demonstrated that whenever there exists a reasonable solution to map inputs onto outputs, deep learning has the ability to find it. And by definition, for any perceptual science problem there is at least one reasonable solution: The one implemented in our brains. So these apparent limitations of deep learning are unlikely to hold for very long: They will be easily cracked, as soon as scientists harness sufficient motivation (which often hinges on the prospect of commercial applications), can properly assess the relevant input and output spaces, and can gather enough training data.

Moreover, there are concerns about the biological plausibility of current machine learning approaches. If our brains' abilities are emulated by algorithms that could not possibly exist in the human brain, then these artificial networks, however powerful, cannot really inform us about the brain's behavior. Such concerns include the great reliance of deep neural networks on supervised learning methods using large datasets of labeled exemplars. In contrast, humans can often learn without explicit supervision or “labels.” Unsupervised learning methods do exist for artificial neural networks, but they often give rise to a feature space that is insufficiently powerful and needs to be complemented by supervised fine-tuning in order to allow, for example, for accurate object recognition (Hinton et al., 1995, 2006; Hinton and Salakhutdinov, 2006). The large amounts of labeled training data required for deep learning can themselves be viewed as implausible. Most important perhaps is the inexistence of a generally accepted equivalent solution to the back-propagation algorithm in biological brains: This algorithm is the cornerstone of deep learning (LeCun et al., 2015), which allows gradient-descent optimization of connection weights to be performed iteratively (via the so-called “chain rule”) through the multiple layers of a network. Furthermore, there are crucial aspects of biological neural networks that are plainly disregarded in the major deep learning approaches. In particular, most state-of-the-art deep neural networks do not use spikes, and thus have no real temporal dynamics to speak of (just arbitrary, discrete time steps). This simplification implies that such networks cannot help us in understanding dynamic aspects of brain function, such as neural synchronization and oscillatory communication. Finally, the most successful deep networks so far have strongly relied on feed-forward architectures, whereas the brain includes massive feedback connections. The popular recurrent neural networks (RNN) are an exception (Hochreiter and Schmidhuber, 1997; Pascanu et al., 2013), but even they have specific short-range feedback loops that do not compare with the brain's long-range connectivity (and the existence of communication “hubs,” like the thalamus).

All these deviations from known biological properties, often motivated by considerations of computational efficiency, do not constitute real barriers, and recent work is starting to reconcile machine learning and brain reality on most of these fronts. Unsupervised and semi-supervised learning methods have been suggested that require no or only few occasional labels to be provided (Anselmi et al., 2013; Doersch et al., 2015; Wang and Gupta, 2015). Some of these methods can also learn features and representations from one or just a few exemplars, a form of “one-shot learning” on par with human capabilities (Anselmi et al., 2013; Rezende et al., 2016; Santoro et al., 2016). At least certain forms of backpropagation appear compatible with a number of biological observations, e.g., spike timing-dependent plasticity (Scellier and Bengio, 2016). Deep neural networks that use spikes are becoming commonplace (Yu et al., 2013; Cao et al., 2015; Diehl et al., 2015; Hunsberger and Eliasmith, 2016; Kheradpisheh et al., 2016a; Lee et al., 2016; Zambrano and Bohte, 2016), and attempts have also been made to introduce oscillatory components in deep networks (Rao and Cecchi, 2011, 2013; Reichert and Serre, 2013). Finally, new DNN architectures are emerging with long-range feedforward (Huang et al., 2016a,b) and feedback connectivity (Pascanu et al., 2013; Zilly et al., 2016). In summary, it would be shortsighted to discard deep learning as irrelevant for understanding biological perception, simply based on its currently imperfect biological plausibility.

A possibly deeper limitation of machine learning lies in the argument that merely replicating behavior in an artificial system does not imply any understanding of the underlying function. In this view, we perception scientists are still left with all the work to do for the latter. But now, we are not limited anymore to studying biological systems through measurements of external behavior or through sparse and nearly-random samplings of neural activity—we can also scrutinize their artificial cousins, the deep neural networks, for which every neuron's activation function is readily accessible, and in which systematic investigations can thus prove much easier.

A Wake-Up Call for Perception Science

Thankfully, there are many other reasons to view the recent machine learning advances in an optimistic light. It is likely that the image and sound synthesis abilities of deep networks (e.g., Figure 1) will serve in the near future as a significant source of well-controlled experimental stimuli, and innovative new experimental designs. Gradient descent can be applied, for example, to create images of objects that will be recognized by humans but not by state-of-the-art deep networks (by designing a “loss function” ensuring that image content is preserved in early layers of the network, but abolished in the final layers), or conversely, non-sense images that “fool” a deep network into recognizing a given object (by inverting the aforementioned loss function) (Nguyen et al., 2014). Which brain regions would respond to the latter, and which to the former? How would event-related potentials, or brain oscillatory activity, react to each image type? Could certain “selective” behaviors (e.g., rapid selective eye movements) be preserved in the absence of explicit recognition?

Deep learning can also turn out to be a source of powerful new data analysis tools. Neuroscience and psychological experiments produce masses of data that can prove challenging for conventional analysis methods. Some 10 or 12 years ago, multivariate pattern analysis (MVPA) methods promised to open new avenues for neuroscience research (Haynes and Rees, 2005; Kamitani and Tong, 2005). Similarly, deep networks could now become a key to reveal the complex mapping between sensory inputs, brain signals and behavioral outputs, and unlock the mysteries of the brain.

Moreover, deep neural networks are also suited to serve a more indirect role in Perception Science, not as a methods tool but as a source of inspiration for existing and novel theories about brain function. Many studies have already started to characterize the existing relations (and differences) between patterns of activity obtained from specific layers of deep networks, and from specific brain regions (Cadieu et al., 2014; Khaligh-Razavi and Kriegeskorte, 2014; Güçlü and van Gerven, 2015; Cichy et al., 2016a,b) or from human behavior (Kheradpisheh et al., 2016b,c). As alluded to in Figure 1B, the powerful latent representation spaces generated by deep neural networks could be used, for example, to study the face inversion effect. They could also help address the debate between expertise vs. domain-specificity in face processing (Kanwisher et al., 1997; Gauthier et al., 1999, 2000; Tarr and Gauthier, 2000; Rossion et al., 2004; Tsao et al., 2006; Freiwald et al., 2009; Freiwald and Tsao, 2010), or between modular vs. distributed object representations (Haxby et al., 2001; Reddy and Kanwisher, 2006), and possibly many others.

Finally, and perhaps most importantly, we should view the amazing recent progress of machine learning as a wake-up call, an occasion to abandon our excuses, and a reason to embolden our approaches. No more “baby steps” for us—the time is ripe to address the big picture.

Forward-Looking Statement

How does our journal fit in this global context? As usual, Frontiers in Perception Science will continue to welcome all original research papers that explore perception in and across any modalities, whether in animals, humans or—why not?—machines, using methods drawn from neuroscience and psychology (but also mathematics, engineering, and computer science). The main criterion for publication is scientific rigor and soundness applied to the study's motivations, methods, and interpretation. Perceived impact or newsworthiness are not relevant factors. While plagiarism is evidently prohibited, explicit replications of previous studies will be viewed favorably. Importantly, these (and any other) papers can equally report positive or negative outcomes –as long as the methodology is rigorous. We hope that we can thereby contribute to resorbing the current confidence crisis in neuroscience and psychology (Ioannidis, 2005; Simmons et al., 2011; Open Science, 2015; Gilbert et al., 2016). Finally, the journal publishes a number of article formats that are complementary to original research and constitute an important resource for the field, such as methods articles, reviews or mini-reviews, perspectives, opinions, and commentaries, hypothesis & theory papers. For these publications as well, the main criterion remains scientific rigor and soundness.

To conclude, as the above arguments should make clear, I believe that the success of deep learning at emulating biological perception is a game-changer that our field cannot ignore. It would be like lighting a fire by hitting stones, with a flamethrower lying on our side. On the other hand, while I formulate the convergence between biological and machine perception (Cox and Dean, 2014; Kriegeskorte, 2015; Marblestone et al., 2016) as both a wish and a prediction for the future of Perception Science as a whole, it is evident that many individual papers or researchers in the field will not be systematically concerned with deep learning. That's still okay—if that is your case, Frontiers in Perception Science will remain a venue of choice for your paper. Just don't motivate it by the “inability of machine perception to achieve human-level performance”: That would be shortsighted.

Author Contributions

The author confirms being the sole contributor of this work and approved it for publication.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This research was supported by an ERC Consolidator grant P-CYCLES number 614244. I wish to thank Parag K. Mital for his open-source code (github.com/pkmital/CADL), and Leila Reddy for useful comments on the manuscript.

References

Anselmi, F., Leibo, J. Z., Rosasco, L., Mutch, J., Tacchetti, A., and Poggio, T. A. (2013). Unsupervised learning of invariant representations in hierarchical architectures. CoRR 1311.4158.