Non-uniqueness phenomenon of object representation in modelling IT cortex by deep convolutional neural network (DCNN)

Recently DCNN (Deep Convolutional Neural Network) has been advocated as a general and promising modelling approach for neural object representation in primate inferotemporal cortex. In this work, we show that some inherent non-uniqueness problem exists in the DCNN-based modelling of image object representations. This non-uniqueness phenomenon reveals to some extent the theoretical limitation of this general modelling approach, and invites due attention to be taken in practice.


INTRODUCTION
Object recognition is a fundamental task of a biological vision system. It is widely believed that the primate inferotemporal (IT) cortex is the final neural site for visual object representation. Due to viewpoint change, illumination variation and other factors, how visual objects are represented in IT cortex, which manifests sufficient invariance to such identity-orthogonal factors, is still largely an open issue in neuroscience.
There are many different natural and manmade object categories, and each category in turn contains various different members. Currently, a number of works in neuroscience advocate the DCNN (Deep Convolutional Neural Network) as a new framework for modeling vision and brain information processing Khaligh and Kriegeskorte , 2014;Kriegeskorte , 2015). In Yamins et al. (2014), Yamins and DiCarlo (2016), DCNN is regarded as a promising general modeling approach for understanding sensory cortex, called "the goal-driven approach." The basic idea of the goal-driven approach for IT cortex modeling can be summarized as: a multi-layered DCNN is trained by ONLY optimizing the object categorization performance with a large set of visual category-labeled objects. Once a high categorization performance is achieved, the outputs of the penultimate layer neurons of the trained DCNN, which are regarded as the object representation, can reliably predict the IT neuron spikes for other visual stimuli in rapid object recognition 1 . In addition, the outputs of the upstream layer neurons can also predict the V4 neuron spikes. The goal-driven approach is conceptually eloquent and has been successfully used to model IT cortex in rapid object recognition and predict category-orthogonal properties (Hong et al., 2016).

Motivation
Although some experimental results have demonstrated the success of the goal-driven approach in modeling IT cortex to some extent as mentioned above, the following uniqueness problem on the fundamental premise of the goal-driven approach is still unclear: does there exist a unique pattern of activations of the neurons (units) in the penultimate layer of a DCNN to a given set of image stimuli by only optimizing the object categorization performance? This uniqueness problem on object representation via a DCNN has a great influence on the theoretical foundation and generality of the goal-driven approach in particular, and the DCNN as a new framework for vision modeling in general.
In this work, we aim to provide a theoretical analysis on this problem as well as some supporting experimental results. Note that our current work is to clarify the non-uniqueness problem in object representation modeling with DCNNs under the goaldriven approach, it does not mean DCNNs could account for IT diverse specifications, as revealed in numerous works (Elston , 2002(Elston , , 2007Jacobs and Scheibel , 2002;Spruston , 2008;Elston and Fujita , 2014;Luebke , 2017).
In order to analyse this problem more clearly, we firstly introduce the definition of DCNN layer's object representation as used for predicting the neuron responses of primate IT cortex in the aforementioned goal-driven approach: Definition 1. For a layer of a DCNN for object recognition, the activations of the neurons in this layer to an input object image is defined as its object representation.
Following the convention in the computational neuroscience, the following representation equivalence is introduced to evaluate whether the object representations learnt from two DCNNs are the same or not: Definition 2. Given a set of object image stimuli, if the two object representations of two DCNNs on these stimuli can be related by a linear transformation, they are considered equivalent, or the same representations. Otherwise, they are different representations.
In the deep learning community, a recent active research topic is called "convergent learning" (Li et al., 2016), referring whether different DCNNs can learn the same representation at the level of neurons or groups of neurons. A generally reached conclusion is that different DCNNs with the same network architecture but trained only with different random initializations, have largely different representations at the level of neurons or groups of neurons, although their image categorization performances are similar. Note that although Li et al.'s work and the goal-driven approach focus on the representation from different points of view, the representations in the two works are closely related. Hence, the results in Li et al. (2016) could also re-highlight the aforementioned uniqueness problem in object representation via a DCNN to some extent.
Addressing this uniqueness problem, we show in the following section that, in theory, by only optimizing the image categorization accuracy, different DCNNs can give different object representations though they have exactly the same categorization accuracy. In other words, the obtained object representations by DCNNs under the goal-driven approach could be inherently non-unique, at least in theory.

Theoretical Analysis and Experimental Results
Proposition 1. If the "Softmax" function is used as the final classifier for image categorization in modeling N categories of objects via a DCNN, and the object category with the largest probability is chosen as the final categorization, and if x = (x 1 , x 2 , · · · , x N ) T ∈ R N is the final output of this DCNN for an input image object I, f (·) is a univariate non-linear monotonically increasing function, y (y 1 , y 2 , · · · , y N ) T = F(x) = (f (x 1 ), f (x 2 ), · · · , f (x N )) T , then x and y give exactly the same categorization result.
Proof: For x and y, their corresponding probability vectors by Softmax are respectively: Since y i = f (x i ) (i = 1, 2, · · · , N) and f (·) is a monotonically increasing function, the magnitude order of elements for x and y does not change. Then the magnitude order of the two probability vectors C x and C y does not change. Since the object category with the largest probability is chosen as the final categorization, both the indices of the largest elements in C x and C y are the same, hence the same categorization results are obtained for x and y.
Remark 1: Since f (·) is a non-linear function, x and y cannot be related by a linear transformation. In addition, in the deep learning community, the Softmax function is commonly used to convert the output vector of the network into a probability vector, and the category with the largest probability value is chosen as the final category.
Remark 2: In theory, f (·) could be different for different input image I. More generally, even the demand of monotonicity for f (·) is unnecessary, we need only the index of the largest value in y is the same to that in x because only the largest value determines the correct categorization. For the Top-K categorization accuracy, we need the index set of the K largest values in y keep the same to that in x, and the rest elements are not required. Hereinafter, for the notational convenience in discussion and practicality of implementation, we always assume f (·) is a univariate non-linear monotonically increasing function.
Proposition 2. As shown in Figure 1, assume that DCNN 1 is a multi-layered network, concatenating a sub-network DCNN P 1 whose output is x, and a fully connected layer with weight matrix Frontiers in Computational Neuroscience | www.frontiersin.org W 1 ∈ R N×M and bias b 1 ∈ R N×1 ({M, N} are the numbers of neurons at the penultimate layer and last layer of DCNN 1 , respectively, with M > N), with x ′ = W 1 x + b 1 . And assume that DCNN 2 is a multi-layered network, concatenating a subnetwork DCNN P 2 whose output is y, and a fully connected layer with weight matrix W 2 ∈ R N×M and bias b 2 ∈ R N×1 , with is a monotonically increasing function, then the object representation x under DCNN 1 cannot be related by a linear transformation to the object representation y under DCNN 2 , or x and y are two different object representations under the goal-driven approach.
is a monotonically increasing function, according to Proposition 1, DCNN 1 and DCNN 2 have the identical image object categorization performance. Since . By Proposition 1, x ′ and y ′ is related by a non-linear function, then x and y cannot be related by a linear transformation either. In other words, x and y are two different object representations under the goal-driven approach.
Remark 3: Since {W 1 , W 2 } ∈ R N×M and M > N in Proposition 2, the pseudo-inverse operator is used in the above proof. Here are a few words on the pseudo-inverse: Since M > N, which is the usual case in most existing DCNNs for object categorization (Krizhevsky et al., 2012;Simonyan and Zisserman, 2014;Szegedy et al., 2015), the inverse ( can be strictly met. Proposition 2 indicates that given DCNN 1 with output x ′ , if there exists another multi-layered network DCNN 2 to output y ′ = f (x ′ ), their representations x and y would be different but with identical categorization performance. This means that the aforementioned non-uniqueness problem in object representation modeling under the goal-driven approach would arise regardless of how many training images are used, and how many exemplar images in each category are included. In other words, the non-uniqueness problem is an inherent problem in DCNN modeling under the goal-driven approach, and it cannot be completely removed by using more training data, at least in theory.
In the above, an implicitly assumption is that given a DCNN 1 with the output x ′ i , there always exists a DCNN 2 with the output Does such a DCNN 2 really always exist? This issue can be separately addressed for the following two cases. The first one is that DCNN 1 and DCNN 2 could be of different architectures, and the second one is that they are of the same architecture, but merely initialized differently during training.

The Different Architecture Case
Proposition 3. There always exists a multi-layered network to map I i to y i for the given input-output pairs {(I i ↔ y i ), i = 1, 2, · · · , n} in Proposition 2.
Proof: As shown in Proposition 2 and Figure 1, since DCNN 1 exists, it maps I to x. Denote this mapping function as x = S 1 (I) = DCNN P 1 (I). Since This is just the required mapping function. According to the Universal Approximation Theorem in Csáji (2001), it could be straightforwardly inferred that there always exists a DCNN with an arbitrary number k + 1(k 1) of hidden layers, denoted as DCNN 2 , whose sub-network DCNN P 2 with k hidden layers is able to approximate this function.
Proposition 3 indicates that given a DCNN 1 , there always exists a DCNN 2 whose architecture may be different from DCNN 1 , so that the object representations of the two DCNNs are different but with the same categorization performance. A training procedure is described in the Appendix, to show how to train such a pair of DCNN 1 and DCNN 2 .
Remark 4: In the proof, the only requirement for DCNN 2 is that it should have sufficient capacity to represent the input object set, but it does not necessarily have a similar network architecture to DCNN 1 . Note that the sufficient representational capacity is an implicit necessary requirement for any DCNNbased applications.
Remark 5: In the proof, the number of input images is assumed to be unknown. However, for the finite-input case, Theorem 1 in Tian (2017) guarantees that there exists a twolayered neural network with ReLU activation and (2n + d) weights, which could represent any mapping function from input to output on sample of size n in d dimensions. Of course, such a constructed network could be of a memorized neural network, i.e., it can ensure the given finite inputs to be mapped to the required outputs, but it cannot guarantee that the constructed network could possess sufficient generalization ability for new samples.

The Same Architecture Case
When DCNN 1 and DCNN 2 are obtained with the same network architecture but only trained under different random initializations, clearly a theoretical proof is impossible. However, based on the reported results in the "convergent learning" literatures as well as our simulated experimental results, it seems they still largely have non-equivalent object representations although they have similar categorization performances.
(1) Non-uniqueness results from "convergent learning" literatures Using AlexNet (Krizhevsky et al., 2012) as a benchmark, Li et al. (2016) showed that by keeping the architecture unchanged but only trained with different random initializations, the obtained 4 DCNNs have similar categorization performances, but their object representations are largely different in terms of oneto-one, one-to-many, and many-to-many linear representation mapping. Note that the many-to-many mapping in Li et al. (2016) is closely related to the equivalence representation in Definition 2. Hence, the four representations are largely non-equivalent and this non-equivalence becomes more prevalent with increasing convolutional layers.
By introducing the concepts of "ǫ-simple match set" and "ǫ-maximum match set, " Wang et al. (2018) showed that for the 2 representative DCNNs, VGG (Simonyan and Zisserman, 2014), and ResNet (He et al., 2016), the size of maximum match set between the activation vectors of individual neurons at the same layer of the two DCNNs, which are also obtained with only different initializations as did in Li et al. (2016), is tiny compared with the number of the neurons at that layer. It was further found that only the outputs of neurons in the ǫmaximum match set can be approximated within ǫ-error bound by a linear transformation, which indicates that for majority of the neurons at the same layer, their outputs cannot be reasonably approximated by a linear transformation, or the corresponding object representations are largely not equivalent.
(2) Non-uniqueness results from our experiments Definition 3. If two DCNNs, DCNN 1 and DCNN 2 , have similar image categorization performances with the same network architecture but different parameter configurations, they are called the similar performing pair of DCNNs.
Generally speaking, our results further confirm the nonuniqueness phenomenon of object representation under the goal-driven approach. We systematically investigated the representation differences between a similar performing pair of DCNNs on the two public object image datasets, CIFAR-10 that contains 60,000 images belonging to 10 categories of objects and CIFAR-100 that contains 60,000 images belonging to 100 categories of objects (Krizhevsky , 2009). In our experiments, 5,000 images per category in CIFAR-10 (also 500 images per category in CIFAR-100) were randomly selected for network training, and the rest for testing. Six network architectures with different configurations (denoted as {D1, D2, D3, D4, D5, D6}) were employed for evaluations, where {D1, D2, D3, D5, D6} were for CIFAR-10 and {D3, D4, D6} were for CIFAR-100 as shown in Table 1.
The traditionally used measure, "explained variance" (EV), was employed to access the degree of linearity between the learnt object representations from a similar performing pair of DCNNs, and we trained similar performing pairs of DCNNs under the following two schemes: • Scheme-1: Both DCNN 1 and DCNN 2 were trained with random initializations. • Scheme-2: Similar to the training procedure in the DCNN 1 was firstly trained with the Softmax loss, and then DCNN 2 was trained by combining the Softmax loss on the neuron outputs of the last layer and the Euclidean loss on the differences between the neuron outputs of the penultimate layer in DCNN 2 and the corresponding terms calculated according to Equation (3) (In our experiments, f (x) = |x| √ x).
Here are some main results from our experiments: (i) Explained variance on standard data The results using the training Scheme-1 are shown in Two points are revealed from these results: • Given a similar performing pair of DCNNs, although the representations of the two DCNNs cannot in theory be related by a linear transformation, the explained variance between the two representations is relatively large. • A similar performing pair of DCNNs with a deeper architecture, or having more layers, will generally have a Max-pool
larger explained variance between the two representations. The underlying reason seems that since a DCNN with a deeper architecture will generally have a larger representational capacity and since a fixed task has a fixed representation demand, a DCNN with a larger capacity will give a more linear representation.
In addition, for a similar performing pair, although their categorization performances are similar, it does not mean that the two DCNNs have the identical categorization label for each input sample, either correct or wrong. We have manually checked the categorization results for CIFAR-10 and CIFAR-100. The orange bars of Figures 2B,D show the computed mean EVs for only those inputs correctly categorized. As seen from Figure 2, the discrepancy of the explained variances between the representations of only the correctly categorized inputs and those of the whole inputs is insignificant and negligible in most cases, and it is perhaps due to the already high categorization rate of the two DCNNs such that the incorrectly categorized inputs only take a small fraction of a relatively large test set.
(ii) Explained variance on noisy data In Szegedy et al. (2014), it is reported that DCNNs are sometimes sensitive to adversarial images, that is, images slightly corrupted with random noise, which do not pose any significant problem for human perception, but dramatically alter the categorization performance of DCNNs. Here, we assessed the noise effects on the representation equivalence on CIFAR-10. The input images are normalized to the range [0, 1], and Gaussian noise with mean 0 and standard variance σ = {0.01, 0.02, 0.03, 0.04, 0.05, 0.07, 0.1} are added into these images, respectively. Figure 3A shows the corresponding categorization accuracies of similar performing pairs of DCNNs under different architectures, while Figure 3B shows the corresponding mean EVs. We find that even under the noise level σ = 0.1, the explained variance does not change much, although the categorization accuracy decreases notably. (

iii) Variations of explained variance by changing stimuli size
In the neuroscience, the number of stimuli could not be too large. However, for image categorization by DCNNs, the size of the test set could be very large. Does the size of stimuli set play a role on the explained variance? To address this issue, we assessed the explained variance as the dataset size increases by resampling subsets from the original test set of images in CIFAR-10. Here, image subset sizes of [1000, 2000, · · · , 10000] are evaluated. Figures 4A,B show the results on the resampled subsets from the whole set of test data and the set of only those images which are correctly categorized, respectively. Our results  show that if the size of the stimuli set reaches a modestly large number (around 3000), the explained variance stabilized. That is to say, we do not need a too large number of stimuli for reliably estimating explained variance. In other words, stimuli in the order of thousands could already reveal the essence, and a further increase of stimuli could not alter much the estimation. (iv) Explained variance vs. neuron selectivity Clearly, some DCNN neurons are more selective than others (Dong et al., 2017(Dong et al., , 2018. Using the kurtosis (Lehky et al., 2011) of the neuron's response distribution to image stimuli, we investigated whether neuron selectivity has some correlation with the explained variance. We chose top {10%, 20%, · · · , 100%} most selective neurons from each DCNN in a similar performing pair, respectively, then computed the explained variance between the two chosen subsets, and the results are shown in Figure 5. As seen from Figure 5, with the increase of the percentage of selective neurons, the explained variance increases accordingly. This indicates that for the object representations of a similar performing pair of DCNNs, neuron selectivity is also an influential factor on their explained variance. The explained variance between the subsets of more selective neurons is smaller, and this result seems to be in concert with the conclusion in Morcos et al. (2018) where it is shown that neuron selectivity does not imply the importance in object generalization ability.
(v) A good representation does not necessarily needs IT-like In the literature (Khaligh and Kriegeskorte , 2014), it is shown that if an object representation is IT-like, it can give a good object recognition performance. This work shows that the inverse is not necessarily true, at least theoretically speaking. That is, as shown in the above experiments and discussions, many different representations can give the same or quite similar recognition results with/without noise. Remark 6: In this work, we assume the final classifier is a Softmax classifier. For other linear classifiers, the general concluding remark of non-equivalence can be similarly derived. Of course, if the used classifier is a non-linear one, or the output of the penultimate layer is further processed by a non-linear operator before inputting it to a linear classifier, as done in Chang Tsao (2017), where a 3-order polynomial is used as a preprocessing step for the final classification, our results will no longer hold. But as shown in Majaj et al. (2015), monkey IT neuron responses can be reliably decoded by a linear classifier, we thought using Softmax as the final classifier for DCNN-based IT cortex modeling could not constitute a major problem for our results.

CONCLUSION
Here, we would say that we are not against using DCNNs to model sensory cortex. In fact, its potential and usefulness have been demonstrated in Yamins et al. (2014) and Yamins and DiCarlo (2016). Here, we only provide a theoretical reminder on the possible non-uniqueness phenomenon of the learnt object representations by DCNNs, in particular, by the goal-driven approach proposed in Yamins and DiCarlo (2016). As shown in the convergent-learning literatures, such a non-uniqueness phenomenon is prevalent in deep learning, hence when DCNNs are used for modeling sensory cortex as a general framework, people should be aware of this potential and inherent nonuniqueness problem, and appropriate network architectures in DCNN learning should be carefully considered.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: http://www.cs.toronto.edu/~kriz/cifar.html.

AUTHOR CONTRIBUTIONS
ZH conceived of the non-uniqueness phenomenon of object representation in modeling IT cortex by DCNN. QD and ZH explored the method. QD and BL implemented the explored method and performed the validation. QD and ZH wrote the paper.