Frontiers in Computational Neuroscience Computational Neuroscience

relations. In mature cortex, there is strong evidence for a basic vocabulary of shape primitives and elementary object parts in the TEO and TE areas of posterior and anterior IT (Fujita et al., 1992; Tanaka, 2003) as well as for identity and category specifi c neurons in anterior IT, PFC and hippocampus (Freedman et al., 2003; Quiroga et al., 2005). Further fi ndings indicate that the encoding of visual objects involves the formation of sparse clusters of distributed activity across the processing hierarchy within IT cortex (Tsunoda et al., 2001; Reddy and Kanwisher, 2006). This seems to be a neuronal basis for the parts-based representation that the visual system employs to construct objects from their constituent part elements (Ullman et al., 2002; Hayworth and Biederman, 2006). In the light of these fi ndings, we may ask ourselves whether the observed memory organization happens to be the outcome of a self-organization process that would have to fi nd solution to a number of developmental tasks. To provide a neural substrate for the parts-based representation, memory traces have to be formed and maintained in an unsupervised fashion to span the basic vocabulary for the visual elements and to defi ne associative links between them. Subsets of associatively linked complex features can then be interpreted as coherent objects composed of the respective parts. As there is a virtually unlimited number of visual objects in the environment, the limited resources spent on formation of these memory traces have to be carefully allocated to avoid unfavorable interference effects and information loss caused by

relations.In mature cortex, there is strong evidence for a basic vocabulary of shape primitives and elementary object parts in the TEO and TE areas of posterior and anterior IT (Fujita et al., 1992;Tanaka, 2003) as well as for identity and category specifi c neurons in anterior IT, PFC and hippocampus (Freedman et al., 2003;Quiroga et al., 2005).Further fi ndings indicate that the encoding of visual objects involves the formation of sparse clusters of distributed activity across the processing hierarchy within IT cortex (Tsunoda et al., 2001;Reddy and Kanwisher, 2006).This seems to be a neuronal basis for the parts-based representation that the visual system employs to construct objects from their constituent part elements (Ullman et al., 2002;Hayworth and Biederman, 2006).
In the light of these fi ndings, we may ask ourselves whether the observed memory organization happens to be the outcome of a self-organization process that would have to fi nd solution to a number of developmental tasks.To provide a neural substrate for the parts-based representation, memory traces have to be formed and maintained in an unsupervised fashion to span the basic vocabulary for the visual elements and to defi ne associative links between them.Subsets of associatively linked complex features can then be interpreted as coherent objects composed of the respective parts.As there is a virtually unlimited number of visual objects in the environment, the limited resources spent on formation of these memory traces have to be carefully allocated to avoid unfavorable interference effects and information loss caused by

INTRODUCTION
A working hypothesis of cognitive neuroscience states that the higher functions of the brain require coordinated interplay of multiple cortical areas distributed over the brain-wide network.For instance, the mechanisms of memory are thought to be subserved by various cortical and subcortical regions, including the medial temporal lobe (MTL), inferior temporal (IT) and prefrontal (PFC) cortex areas (Fuster, 1997;Miyashita, 2004) to name only few of them prominent in the function of the visual memory.Studies of information processing going on in the course of encoding, consolidation and retrieval of visual representations reveal a hierarchical organization, sparse distributed activity and massive recurrent communication within the memory structure (Tsao et al., 2006;Konen and Kastner, 2008;Osada et al., 2008).Here we focus our attention on developmental issues and discuss the process of self-organization that may lead to the formation of the core structure responsible for fl exible, rapid and effi cient memory function, with organizational properties as inferred from the experimental works.
It is widely held that processes responsible for memory formation rely on activity-dependent modifi cation of the synaptic transmission and on regulation of the intrinsic properties of single neurons (Bear, 1996;Miyashita, 1988;Zhang and Linden, 2003).However, it is far from clear how these local processes could be orchestrated for memorizing complex visual objects composed of many spatially distributed subparts arranged in stereotypic Experience-driven formation of parts-based representations in a model of layered visual memory

Jenia Jitsev 1,2 * and Christoph von der Malsburg 1
1 Frankfurt Institute of Advanced Studies, Frankfurt Am Main, Germany 2 Johann Wolfgang Goethe University, Frankfurt Am Main, Germany Growing neuropsychological and neurophysiological evidence suggests that the visual cortex uses parts-based representations to encode, store and retrieve relevant objects.In such a scheme, objects are represented as a set of spatially distributed local features, or parts, arranged in stereotypical fashion.To encode the local appearance and to represent the relations between the constituent parts, there has to be an appropriate memory structure formed by previous experience with visual objects.Here, we propose a model how a hierarchical memory structure supporting effi cient storage and rapid recall of parts-based representations can be established by an experience-driven process of self-organization.The process is based on the collaboration of slow bidirectional synaptic plasticity and homeostatic unit activity regulation, both running at the top of fast activity dynamics with winner-take-all character modulated by an oscillatory rhythm.These neural mechanisms lay down the basis for cooperation and competition between the distributed units and their synaptic connections.Choosing human face recognition as a test task, we show that, under the condition of open-ended, unsupervised incremental learning, the system is able to form memory traces for individual faces in a parts-based fashion.On a lower memory layer the synaptic structure is developed to represent local facial features and their interrelations, while the identities of different persons are captured explicitly on a higher layer.An additional property of the resulting representations is the sparseness of both the activity during the recall and the synaptic patterns comprising the memory traces.
potential memory content overlap.Thus, the system is permanently confronted with the problem of selecting the right small population out of the totally available, potentially confl icting synaptic facilities which has to be modifi ed for acquisition and consolidation of a novel stimulus.Moreover, if objects stored in memory are supposed to share common parts, a regulation mechanism would be required to balance the usage load of part-specifi c units and minimize the interference, reassuring their optimal participation in memory content formation and encoding.Another issue is the timing of the modifi cations, which have to be coordinated properly if the correct relational structure of distributed parts constituting the object's identity is to be stored in the memory.
The same selection problem arises on the fast time scale, during memory recall or for encoding of a novel object.Currently, there is a broad agreement on the sparseness of the activity patterns evoked by the presentation of a complex visual object, where only a small fraction of the available neurons in the higher visual cortex participate in the stimulus-related response (Rolls and Tovee, 1995;Olshausen and Field, 2004;Quiroga et al., 2008).In the context of the parts-based representation scheme, one possible interpretation of sparse activation would be the selection of few parts from a large overcomplete vocabulary for the composition of the global visual object.Considering the speed of object recognition measured in psychophysical experiments on humans and primates (Thorpe and Fabre-Thorpe, 2001), there have to be neural mechanisms allowing this selection procedure to happen within the very short time of a few hundred milliseconds.Moreover, if relations are to be represented by dynamic assemblies of co-activated part-specifi c neurons, such a combinatorial selection would require clear unambiguous temporal correlations between the constituent neurons to identify them and only them as being part of the same assembly encoding the object (Singer, 1999;von der Malsburg, 1999).
Hypothesizing that the process of neural resource selection and its coordination across distributed units is a crucial ingredient for successful structure formation and learning, we address in this study the neural mechanisms behind the selection process by incorporating them in a model of a layered visual memory.Here we take the competition and cooperation between the neuronal units as the functional basis for the structure formation (von der Malsburg and Singer, 1988;Edelman, 1993) and provide modifi cation mechanisms based on activity-dependent bidirectional plasticity (Bienenstock et al., 1982;Artola and Singer, 1993) and homeostatic activity regulation (Desai et al., 1999).We confront the system with a task of unsupervised learning and human face recognition using a database of natural face images.Our aim is then to demonstrate the formation of synaptic memory structure comprising bottom-up, lateral and top-down connectivity.
Starting from an initial undifferentiated connectivity state, the system is able to form a representational basis for the storage of individual faces in a parts-based fashion by developing memory traces for each individual person over repetitive presentations of the face images.The memory traces are residing in the scaffold of lateral and top-down connectivity making up the content of the associative memory that holds the associatively linked local features on the lower and the confi gurational global identity on the higher memory layer.The recognition of face identity can then be explicitly signaled by the units on the higher memory layer (Figure 1).
By performing this self-organization, the system solves a highly non-trivial and important problem of capturing simultaneously local and global signal structure in an unsupervised, open-ended fashion, learning not only the appearance of local parts, but also memorizing their combinations to represent the global stimulus identity explicitly in lateral and top-down connectivity.None of the previous works on unsupervised learning of natural object representation were able to solve this problem in this explicit form (Wallis et al., 2008;Waydo and Koch, 2008).
As a consequence of this explicit representation, the local facial features are interpreted in the global context of the identity of a person, making use of the structure formed in the course of previous experience.This contextual structure can also be utilized in generative fashion to replay the memory content in absence of external stimuli, also supporting the mechanism of selective object-based attention.The binding of the local features and their identity label into a coherent assembly is done in the course of a decision cycle spanned by a common oscillatory rhythm.The rhythm modulates the competition strength and builds up a frame for repetitive local winner-take-all computation.As the agreement between incoming bottom-up, lateral and top-down signals gets continuously improved during the competitive learning, the bound assemblies tend to refl ect more and more consistently the face identities stored in the memory, so that the recognition error progressively decreases.Moreover, the employment of the contextual connectivity speeds up the learning progress and leads to a greater capability to generalize over novel data not shown before.The advanced view on the structure formation as an optimization process driven by evolutionary mechanisms of selection and amplifi cation may also serve as a conceptual basis for studying self-organization of generic subsystem coordination, independent of the nature of the cognitive task.

VISUAL MEMORY NETWORK ORGANIZATION
Our model is based on two consecutive interconnected layers (Figure 1), which we tend to identify with the hierarchically organized regions of IT and PFC, containing a number of segregated cortical modules that will be termed columns (Fujita et al., 1992;Mountcastle, 1997;Tanaka, 2003).The columns situated on the lower layer will be termed here bunch columns, as each of them are supposed to hold a set of local facial features acquired in the course of learning.The column on the higher memory layer will be called identity column as its task will be to learn the global face identity for each individual person composed out of distributed local features on the lower memory layer.Being a local processing module, each column contains further a number of subunits we call core units (or simply units), which receive common excitatory afferents and are bound by common lateral inhibition.Acting as elementary processing units of the network, the core units represent an analogy to a tightly coupled population of excitatory pyramidal neurons ("pyramidal core") as documented in cortical layers II/III and V (Peters et al., 1997;Rockland and Ichinohe, 2004;Yoshimura et al., 2005).These populations are thought to be capable of sustaining their own activity even if afferent drive is removed.
On the lower level of processing, each bunch column is attached to a dedicated landmark on the face to process the sensory signal represented by a Gabor fi lter bank extracted locally from the image (Daugman, 1985;Wiskott et al., 1997).The connections bunch units receive from the image constitute their bottom-up receptive fi elds (here, referring to a receptive fi eld we always mean the pattern of synaptic connections converging on a unit).Furthermore, there are excitatory lateral connections between the bunch columns on the lower layer binding the core units across the modules.The bunch units also send bottom-up efferents to and get top-down afferent projections from the identity units situated on the higher level of processing.All the types of intercolumnar synapses are excitatory and plastic, the connectivity structure being all-to-all homogeneous in the initial state.

DYNAMICS OF A CORE UNIT
A cortical column module containing a set of n core units is modeled by a set of n differential equations each describing the dynamic behavior of the unit's activity variable p.The basic form of the equation, ignoring the afferent inputs for the time being, is motivated by a previous computational study on a cortical column (Lücke, 2005): where τ is the time constant, α the strength of the self-excitatory, β the strength of self-inhibitory effects, λ the strength of the lateral inhibition between the units, ν the inhibitory oscillation signal and max ( → P t ) the activity of the strongest unit in the column module.In this study we set for all units τ = 0.02 ms, α = β = 1, λ = 2.As p refl ects the activity of a whole neuronal population receiving common afferents, we may assume a small time constant value, referring to an almost instantaneous response behavior of a suffi ciently large (n = 100 or more) population of neurons (Gerstner, 2000).
A crucial property of the column dynamics is the ability to change the structure of the stable activity states by variation of the parameter ν.We take the oscillatory inhibition activity ν (Figure 2) to be of a form: with its period T = 25 ms being in the gamma range.ν min and ν max are the lower and upper bounds for oscillation amplitude, T init , k, g parameterize the form of the sigmoid activity curve.Here the values are set to ν min = 0.005, ν max = 1.0, T init = 5 ms, g = 0.5, k = 2.With the rising strength of inhibition, the parameter ν crosses a critical bifurcation point of structural instability ν c , given by the ratio between the self-excitation and self-inhibition coupling strength: so that here ν c = 0.5.For the range ν < ν c any units subset can remain active (with the stationary activity level p = α α β + ), as these states are stable given the low strength of lateral inhibition.After crossing the critical value ν c , all those states having more than one unit active loose stability, so that only a single winner unit can remain active on the level α α β + .The bifurcation property realizes winner-take-all behavior of the column acting as a competitive decision unit (Lücke, 2005) to select the best response alternative on the basis of the incoming input.
The qualitative dynamical behavior stays the same in the extended formulation of the activity equation, which is: where I BU , I LAT , I TD are the afferent inputs of respective bottom-up, lateral and top-down origin, κ BU = κ LAT = κ TD = 1 are their coupling coeffi cients, ω is an excitatory oscillatory signal, θ an excitability threshold of the unit, σ = 0.001 is parameterizing the multiplicative Gaussian white noise η t and ε is an unspecifi c excitatory drive.θ is a dynamic threshold variable used for homeostatic activity regulation of the unit, it will be described later in detail; ε depends on the total number of core units n, ε = 1 5n .An important modeling assumption is the separation of the synapses of different origin as implemented in Eq. 4. This separation causes different synaptic inputs to have different impact on the activity of the unit.The functional difference can be made explicit by taking a glance at the stable state of the winner unit (assuming for clarity σ = ε = θ = 0), which takes the value: where bottom-up (BU) input I BU contributes to the activity level in a linear fashion, while the contribution of lateral (LAT) and top-down (TD) inputs I LAT and I TD is non-linear, resembling the purely driving and hybrid driving-modulating roles of afferents from different origin commonly assumed for cortical processing (Sherman and Guillery, 1998;Friston, 2005).Simply stated, the separation of incoming synapses across the cortical layers follows a generic scheme where bottom-up incoming afferents arrive in layer IV on the spiny stellate cells, while the vast majority of LAT and TD synapses contacts the apical dendrites of pyramidal neurons from the layers II/III and V (Felleman and Essen, 1991;Douglas and Martin, 2004;Thomson and Lamy, 2007).Here, we oppose the functional role of LAT and TD afferents to the purely driving character of BU input by using the former inputs for the modulation of the self-excitation term in the unit's dynamics (Eq.4).The stronger the input from LAT or TD afferents, the stronger is the self-excitatory coupling within the core unit.This potentiates the core unit to amplify its activity stronger and faster than the units with lower coupling strength, thus favoring it in the competition.The course of the activity is also infl uenced by the excitatory oscillatory activity ω (Figure 2), which is given by: where ω min = 0.25 and ω max = 0.75 are the lower and upper bounds for oscillation amplitude.The excitatory oscillation doesn't have any impact on the critical bifurcation point ν c , as it modulates the self-excitation coupling strength α and the lateral inhibition strength λ to the same extent (Eq.4).Instead, it elevates the activity level of the units as long as they manage to resist the rising inhibition and remain in the active state.In the state where lateral inhibition gets strong enough to shut down all but the strongest core unit, only this winner unit is affected by the elevating impact of the excitatory oscillation, being able to further amplify its activity at the cost of suppressing the others.Moreover, ω controls the impact of LAT and TD afferents on the unit's activity, the impact being weak at the begin of the decision cycle with low ω, getting then stronger as ω grows and reaching maximum at the peak of the oscillation at the cycle's end.Thus, the contextual infl uence on local decision making is gradually adjusted in tune with the amount of evidence available during the cycle.Both inhibitory and excitatory oscillations may have presumably different sources, the former being generated by the interneuron network of fast-spiking inhibitory cells (Whittington et al., 1995) and the latter having its origin in activities of fast rhythmic bursting, or chattering, excitatory neurons (Gray and McCormick, 1996).
In addition to the local competitive mechanism supported by the lateral inhibition within a column, we use a simple form of forward inhibition (FFI) acting on the incoming afferents (Douglas and Martin, 1991).To model this, the incoming presynaptic activities are transformed as following before they make up the afferent input via the respective receptive fi eld of a unit: where p pre stands for raw presynaptic activity, pre p is the presynaptic activity transformed by FFI, K is the total number of incoming synapses of a certain origin, the weights w i Source constitute the receptive fi eld and I Source designates the fi nal computed value of the afferent input from the respective origin.Although all plastic synaptic connections in the network are taken to be of excitatory nature, FFI allows units to exert inhibitory action across the columns.An important effect of this processing is the selection and amplification of strong incoming activities at the cost of weaker ones, which can be interpreted as presynaptic competition among the afferent signals (Douglas and Martin, 1991;Swadlow, 2003).This is supposed to enhance the effect of competition between assemblies coding for different faces, as strong assemblies become able to disrupt cooperation within weaker ones.Another advantage of FFI is that it helps to avoid useless computation on the postsynaptic side by canceling the incoming excitation if the activity differences within the transmitting column are too small, indicating only little progress in the decision process.This functionality has roughly the meaning of "no decision -nothing to react to".
An additional property of the dynamics is the natural restriction of the population activity values p to the interval between 0 and 1 (Eq.5), given that the afferent input also stays in the same range.This allows both interpretations of the variable as either the population rate or the probability of an arbitrary neuron from the population to generate a spike.

HOMEOSTATIC ACTIVITY REGULATION
The activity dynamics equation (Eq.4) contains the variable threshold θ, which regulates the excitability of the unit.Here, higher values of θ stand for higher unit excitability, implying a greater potential to become active given a certain amount of input.The threshold is updated according to the following rule: where < > ( ) is the average activity of the unit measured over the period T of a decision cycle, p aim specifi es the target activity level and τ θ 10 ms is the inverse time constant.The target activity level p aim depends on the number of units n in a column, p n aim = .
1 The initial value of the excitability threshold is zero, θ(0) = 0.As of its direct relation to the unit's excitability, we will term θ simply excitability whenever it is more suitable in the context.
The motivation behind this homeostatic regulation of unit's activity (Desai et al., 1999;Zhang and Linden, 2003) is to encourage a uniform usage load across units in the network, so that their participation on the formation of the memory traces is balanced.
Bearing in mind the strongly competitive character of the columnar dynamics, the regulation of the excitability threshold changes the a-priori probability of a unit to be winner of a decision cycle.So, if a certain unit happens to take part too frequently in encoding of the memory content, violating the requirement of the uniform win probability across the units, its excitability will be downregulated so that the core unit becomes more diffi cult to activate, giving an opportunity for other units to participate in the representation.Reversely, a unit being silent for too long is upregulated, so it can get excited more easily and contribute to memory formation.

ACTIVITY-DEPENDENT BIDIRECTIONAL PLASTICITY
We choose a bidirectional modifi cation rule to specify how a synapse connecting one core unit to another may undergo a change in its strength w: with the sign switch functions H(x) and H − + ( ) x given as following: providing the bidirectional form of the synaptic modifi cation.The amplitude of the change is determined by the correlation between the presynaptic activity p pre and the postsynaptic activity p post , both variables being non-negative due to the properties of the unit activity dynamics.The learning rate ε = 5 × 10 −4 ms −1 specifi es the speed of modifi cation being the inverse time constant.Other variables determine the sign of the modifi cation.The threshold θ − + = max( ) p is used to compare the postsynaptic activity against current maximum activity in the column.A(t) is the the total activity level in the postsynaptic column at time point t, A t p t , where n is the number of units in the column and p i (t) their activities at time point t.A(t) is compared to a variable gating threshold χ, which pursues the average total activity level <A(t)> computed over the period T of a decision cycle: 10 ms as inverse time constant, the threshold initial value set to χ(0) = 0.5.Furthermore, the postsynaptic activity p post is compared to the sliding threshold θ 0 − that follows the average postsynaptic activity <p post (t)> computed over the period T of a decision cycle: with the inverse time constant τ θ 0 The rule employed here is a simplifi ed version of a bidirectional modifi cation assuming the existence of two sliding thresholds θ 0 − and θ − + (Figure 3), which subdivide the range of postsynaptic activity into zones where no modifi cation, depression or potentiation may occur, resembling BCM and ABS learning rules rooted in neurophysiological fi ndings (Bienenstock et al., 1982;Artola and Singer, 1993;Bear, 1996;Cho et al., 2001).If the postsynaptic activity level is too low ( ) , p post < − θ 0 no modifi cation can be triggered.A mediocre level of activation ( ) promotes longterm depression (LTD, negative sign), and a high level of activity ( ) p post > − + θ makes long-term potentiation (LTP, positive sign) possible.Combined with the winner-take-all-like behavior of the core units, the intended effect of the rule is to introduce the competition in synaptic formation across the receptive fi elds of the units, enabling them to separate patterns even if they are highly similar and overlap strongly.If multiple core units are frequently co-activated by a stimulus, the winner unit gets an advantage in potentiating its stimulated synapses, while the stimulated synapses of the units with lower activity either do not change or are affected by the depression.If this situation occurs over and over, the receptive fi elds of previously co-activated units are supposed to drift apart preferring the structure where strong synapses are not in confl ict with each other anymore.This should dampen the overlapping features and emphasize the discriminative features of the patterns preferred by the units.
In addition, we here use multiplicative synaptic scaling applied to synapses grouped according to their origin (bottom-up, lateral and top-down).We model this simply by L 2 -normalization of the receptive fi eld vector, w w i i Source Source = / , w Source 2 with w i Source as a weight of the receptive fi eld comprising the synapses of the respective origin Source ∈ {BU, LAT, TD}, and i w Source its normalized version.The normalizing procedure can be applied after a number of decision cycles, here we choose this number to be 10 cycles.The scaling mechanism promotes competition between synapses within the receptive fi eld, as the growth of one synapse happens at the cost of the weakening the others (Miller and MacKay, 1994).

Data format
To provide the system with natural image input, we choose the AR database containing grayscale human face photographs of 126 persons in total (Martinez and Benavente, 1998).For each person, there is a number of views taken under different conditions (Figure 1B).The original view with neutral facial expression is accompanied by a duplicate view depicting the same person at a later time point (2 weeks after the original shot).Furthermore, there are variations in emotional expression such as smiling or sad for both original and duplicate views.The images were automatically prelabeled with a graph structure put upon the face, positioning nodes on consistent landmarks across different individuals with a software (EAGLE) based on the algorithm described in (Wiskott et al., 1997).A subset of L = 6 facial landmarks was selected around the eyes, nose and mouth regions (Figure 1C), each landmark being subserved by a single bunch column.Being attached to a dedicated facial landmark, each bunch column is provided with a sensory image signal represented by a Gabor fi lter bank extracted locally.The Gabor wavelet family used for the fi lter operation is parameterized by the frequency k and orientation ϕ of the sinusoidal wave and the width of the Gaussian envelope σ (Daugman, 1985).We use s = 5 different frequencies and r = 8 different orientations sampled uniformly to construct the full fi lter bank (for more details refer to Wiskott et al., 1997).The local fi ltering of the image produces a complex vector of responses, containing both amplitude and phase information.We use only the amplitude part consisting of s × r = 40 real coeffi cients to model the responses of complex cells.This amplitude vector is further normalized by L 2 -Norm to serve as bottom-up input for the respective landmark bunch column of the lower memory layer.

Network confi gurations
Selecting randomly P = 40 persons from a database, we allocate n = 20 core units for each bunch column to ensure that multiple persons have to share some common parts.The identity column then contains m = 40 units corresponding to the number of persons we want be able to recall explicitly.Two different confi gurations of the memory system are employed to test our hypothesis about the functional advantage of a fully recurrent structure over the purely feed-forward one.Each confi guration is supposed to form the memory structure in the course of the learning phase.While the fully recurrent confi guration learns bottom-up, lateral and top-down connectivity, the purely feed-forward confi guration is a stripped-off version using only the bottom-up pathways.Observing these different confi gurations during the learning phase and testing them on novel face views subsequently, we are able to compare both in terms of learning progress and performance on the recognition task to fi nd out potential functional differences between them.

Simulation
In order to run the memory network, the solutions for the differential equations governing the behavior of dynamical variables have to be computed numerically in an iterative fashion.We use a simple Euler method with a fi xed time step Δt = 0.02 ms to do this.To save computational time, slow threshold variables are updated once in a decision cycle, correcting the time steps accordingly.

Open-ended unsupervised learning
The system starts with homogeneously initialized structure parameters, all threshold values and all synaptic weights being undifferentiated, so that intercolumnar all-to-all connectivity is the initial structure of the memory network.During the iterative learning procedure, for each decision cycle a face image is selected from a database randomly and presented to the system, evoking a pattern of activity on both memory layers and triggering synaptic and threshold modifi cation mechanisms.The learning procedure is open-ended as there is neither a stop condition nor an explicitly defi ned time-dependent learning rate variables which would decrease with time progress and freeze modifi cations at some point.The learning progress can be assessed directly by evaluating the recognition error on the basis of the previous network responses.Further, the inspection of the structure of the receptive fi elds delivers hints about their maturation progress.Investigating the rate of ongoing modifi cations of the synaptic weights and dynamic thresholds could give a hint on whether the changes in the network structure are still taking place in signifi cant proportion, providing a basis for a stop condition if necessary.In the later learning phase the general stability of the established structure can be also verifi ed by simple visual inspection.

Performance evaluation
To assess the recognition performance of the system, we make a distinction between the learning and generalization error.The learning error is defi ned as a rate of wrong responses to person identity from the training data set containing the original face views with neutral expression.The statistics of response behavior to each particular person is gathered for each identity core unit over the history of the network stimulation.The learning error rate can then be computed for each small interval during the learning phase by using the preferences the identity units have developed for the individual persons during the preceding stimulation.Opposed to this, the generalization error is computed on the set of novel views not presented before.During the test for generalization error, all the synaptic weights are frozen, which is done to exclude the possibility that recognition rate improves during the testing phase due to potential benefi t of synaptic modifi cations.The generalization error is assessed for each view type separately to see potential performance differences between different views (the duplicate view and the views with two different emotional expressions, smiling and sad).The history of network behavior during the learning phase is used again in the same way for the computation of the error rate, as done for the learning error evaluation.

ASSESSING NETWORK'S ORGANIZATION
To analyze the progress of structure formation, we use measures describing different properties of the receptive fi elds.The distance measure calculates the distance between two synaptic weight vectors w i and w j : where φ denotes the angle between the two synaptic weight vectors each comprising a receptive fi eld.The value lies in the interval between zero and one.If the weight vectors are the same, the distance value is zero, if their dissimilarity is maximal (φ = π), the value is one.Utilizing this basic distance measure, we further construct a differentiation measure, which is supposed to refl ect the grade of differentiation between the receptive fi elds of the same type across the whole network.The differentiation grade D k Source is computed for each column for the receptive fi elds of a given type Source ∈ {BU, LAT, TD} and then an average differentiation value D Source is built from the values of all K columns: where n is the number of units in the column.The differentiation grade measure is evaluated separately for bunch columns on the lower memory layer and for the identity column on the higher memory layer.
Further we employ a measure refl ecting the property of the inner structure of a receptive fi eld to be sparse, that is, possessing few strong synapses and many weak synapses comprising the receptive fi eld.If the inner receptive fi eld structure is poorly differentiated the sparseness value will be low; if differentiation within the receptive fi eld is strong, then the value will be high.To assess the same property not only within, but also across receptive fi elds, the overlap measure is defi ned.If the receptive fi elds of the same type have many strong overlapping synapses in common the value will be high, if there are only few such overlapping synapses the value will be low.The overlap measure is thus closely related to the differentiation grade between the receptive fi elds as assessed using the distance measure.Both sparseness denoted as ζ and overlap denoted as ξ have the same scheme behind their computation, with the only difference that the former is computed within while the latter across the receptive fi eld vectors using a common selectivity measure A Source (s) as defi ned in (Rolls and Tovee, 1995).Again, the computation is done for each column on receptive fi elds of the same type Source ∈ {BU, LAT, TD}, building then type-specifi c average values C Source and ε Source over all K columns: where r is the number of synapses comprising a receptive fi eld of type Source ∈ {BU, LAT, TD}, n is the number of units in a column, and K is the total number of assessed columns.The evaluation is done separately for the bunch columns and the identity column.

STRUCTURE FORMATION
Facing a task of unsupervised learning, the system develops a structural basis for storing the faces of individual persons shown during the learning phase.The vocabularies for the distributed local features are created on the lower memory layer to represent facial parts.These vocabularies are formed by the bottom-up synaptic connections of the bunch columns attached to their facial landmarks (Figure 4A).Each core unit of the bunch columns becomes thus sensitive to a particular local facial appearance due to the established structure of its bottom-up receptive fi eld.At the same time, the lateral connectivity between the bunch columns gets shaped capturing the associative relations between the distributed features (Figure 4B).These relations are represented by associative links between those core units that are regularly used in the composition of a particular individual face.The same compositional information enters into the structure of bottom-up connectivity converging on the identity column units (Figure 4D), being also represented in the top-down connections projecting from the identity column back on the lower layer (Figure 4C).Each person repeatedly presented to the system during the learning phase leaves a memory trace comprising the parts-based representation of its face on the lower layer and the explicit confi gurational identity on the higher layer of the memory (Figure 4).The course of gradual differentiation of bottom-up, lateral and top-down connectivity reveals the ongoing process of memory consolidation, where memory traces induced by the face images become more stable and get opportunity to amplify their structure.A common developmental pattern seems to underlie the time courses of structure organization (see Assessing Network's Organization).There is an initial resting phase, where no structural changes appear, followed by a maturation phase, where massive reorganization occurs and change rate peaks at its maximum value (Figures 5 and 6).Finally a saturation phase is reached, where the structure stabilizes at a certain level of organization and the change rate goes down close to zero.
Different connectivity types get organized preferentially within a specifi c time window (Figures 5 and 6).There is a clear temporal sequence of connectivity development, starting with maturation of lower layer bottom-up connections, followed by maturation of lateral connections between the bunch columns and by the maturation of bottom-up connectivity of the identity column, ending with the formation of top-down connectivity.Because the development of different connectivity types is highly interdependent, their developmental phases are not disjunct in time, but overlap substantially.In parallel, there is a gradual increase in sparseness within the receptive fi elds and progressive reduction of the overlap between them (Figure 6).The remaining overlap in associative lateral and confi gurational bottom-up connectivity refl ects the extent to which the parts are shared among different stored face representations.
In the late learning phase, the state of the synaptic structure stabilizes until no substantial changes in the established memory structure can be observed (Figures 5 and 6).Remarkably, the bottom-up connectivity of the bunch columns stays well behind other connectivity types in terms of differentiation grade, sparseness within the receptive fi elds and their overlap reduction achieved in the fi nal stable state (Figures 5 and 6).While being the latest to initiate its maturation, the top-down connectivity reaches the highest grades of differentiation and sparseness, also being most successful in reducing the overlap.The lateral connectivity between the bunch columns and bottom-up connectivity of the identity column also show comparably high level of organization.These relationships refl ect the distinct functional roles the different connectivity types play in their contribution to the memory traces -capturing strongly similar local feature appearance in case of lower layer bottom-up connectivity on the one hand and on the other hand storing weakly overlapping associative and confi gurational information for different faces in case of lateral and top-down connectivity.
The changes in the synaptic structure are accompanied by the use-dependent regulation of the excitability thresholds of the core units across the network.Three developmental phases can be distinguished in the time course of excitability modifi cations (Figure 7).The fi rst phase is characterized by strong and rapid excitability downregulation in the network.This downregulation settles down the core units toward the range of the targeted average activity level p aim (Eq.8).In this phase, almost no differences between the individual thresholds are present (Figure 8).After downregulation crosses its peak, a common upregulation sets in and the differences between the excitability thresholds become much more prominent.The upregulation phase leads to a slight increase of the average excitability and is followed by a saturation phase where the average threshold value stabilizes around certain level.
Excitability regulation runs differently on different memory layers.On the lower layer the down-and upregulation phases are   shorter and occur earlier than the corresponding phases on the higher layer.Moreover, the differences in excitability between the units on the lower layer are much stronger pronounced compared to the rather equalized excitability levels of the higher layer units (Figures 7 and 8).
These differences refl ect the distinct functional roles the lower and higher layer play in the memory organization.The lower layer serves as a storehouse for associatively linked distributed facial parts that can be shared by multiple face representations, while the identity units are conjunction-sensitive units representing the confi gurational identity of the face.Because each memorized person is equally likely to appear on the input, the long-term usage load of the identity units is essentially the same, so no need for a systematic differentiation of excitability thresholds arises there.Part sharing on the other hand imposes different usage frequency on different core units sensitive to different parts, leading to pronounced usedependent differences in excitability between the bunch column core units.

ACTIVITY FORMATION AND COORDINATION
The established synaptic structure supports the parts-based representation scheme by encoding the relations between the parts in two alternative ways.First, the relations can be explicitly signaled by the responses of conjunction, or confi guration, specifi c identity core units on the higher layer, each responsible for one of the face identities stored in the memory.Second, the relations can be represented by dynamic assemblies of co-activated part-specifi c bunch core units, which can be constructed on demand to encode a novel face or to recall an already stored one as a composition of its constituent parts.The selection and binding of the parts-specifi c and identity-specifi c units into a coherent assembly coding for an individual face is done in the course of a decision cycle defi ned by common unspecifi c excitatory and inhibitory signals oscillating in the gamma range (Singer, 1999;Fries et al., 2007).
There, the global decision process which may be called binding by competition is responsible for assembly formation, providing clear and unambiguous temporal correlations between the selected units and setting them apart against the rest by amplifi cation of their response strength (Figure 9).The initial phase of the decision cycle, where the oscillatory inhibition and excitation are low, is characterized by low undifferentiated activities of the network units.With both inhibition and excitation rising, only some of the units are able to resist the inhibition pressure and continue increasing their activity being selected as candidates for assembly formation in the selection phase.Ultimately, the growing competition leads to a series of local winner-take-all decisions across the columns sparsening the activity in the network by strong amplifi cation of a small unit subset at the cost of suppression of the others.In the late phase of a decision cycle, this amplifi ed subset of winner units can be then clearly interpreted as an individual face composed of the local features from respective landmarks and labeled with person's identity, solving the assembly binding problem (Singer, 1999;von der Malsburg, 1999).
A combined view on the mean activity within the columns reveals once more the competitive nature of activity formation in the network (Figure 9).While the winner unit subset concentrates increasingly high activity, the mean network activation gets progressively reduced at the end of the decision cycle after crossing its peak in the selection phase, indicating that winner subset amplifi cation occurs at the cost of suppressing the rest.Generally, during the whole decision cycle the mean network activity stays at a low level (p = 0.08 − 0.09), far below the activity level reached by the winner units subset at the end of the cycle (p = 0.4 − 0.6).
One may ask to what extent the competitive activity formation becomes more organized or coherent in terms of representing the memory content as the learning progresses.In other words, we are interested in the level of coherence, or agreement, between the local competitive decisions made in the distributed columns and how it may change with the learning time.One indicator of such coherent behavior is the agreement achieved at the end of the decision cycle between the afferent signals that arrive at network units from different sources such as bottom-up, lateral or top-down.By computing the standard correlation coeffi cient ρ (DeGroot and Schervish, 2001), we obtain for each afferent signal pair of different sources a course showing the development of the coordination between the signals over the learning time.
The coordination level between the bottom-up, lateral and topdown signals increases gradually from the initially very low value close to zero toward higher and higher grade (Figure 10).The low coherence value in the early learning phase reveals the inability of the signals converging on the network units to be in consensus with each other about the local decision outcome, deranging the global decision making.As learning progresses, the signal pathway structure is gradually improved for the storage and representation of the content, leading to stronger and stronger consistency FIGURE 8 | (A) Time course of average excitability regulation.Above the whole course, below the zoom into down-and upregulation phases.On the left for the bunch units, on the right for the identity units.Black solid curve is the average value, gray curves mark the standard deviation range.The same nomenclature applies for the time course of the average unit activity visualized in (B).As visible in (A), the differences in excitability between the units are more pronounced on the lower layer compared to the higher one.This is refl ected again in the greater dispersion of the unit activities around the average activity level on the lower layer, as shown in (B).During the course of a single decision cycle, a co-activation measure can be used to check whether the incoming signals are coordinated properly to make up the decisions.The relationship between the afferent signal coordination and the function of the memory is particularly clear if the coordination level in a successful recall is compared to the coordination shown during a failed recall, where the identity of the person is misclassifi ed (Figure 11).In a successful recall, where the facial representation and person's identity are correctly retrieved from the memory, a well-established coordination can be observed between the co-active afferent signals converging on the winner units.In a failed recall, the identity column making a wrong decision sends top-down signals that are not in agreement with the bottom-up and lateral signals conveyed by the bunch columns.As consequence, the signal coordination breaks down, serving as a reliable indicator of a recall failure (Figure 11D).This disagreement between the sensory and contextual signaling can be interpreted as an error signal, indicating a deviation between the bottom-up signal and the top-down prediction.Although currently not represented explicitly by the activity of a dedicated unit, this signal could be potentially of great use for determining the state of the recognition process and for guiding learning as an explicit reinforcement signal.
A further indicator that can help in differentiating a successful from a partially or completely failed recall is the activity level of the winner units at the end of the decision cycle.A successful recall is accompanied by a high degree of cooperation between the participating winner units, so that the level of their fi nal activation is high.At the same time, the competitive action of the winner units subset suppresses strongly the rest activity, so that the overall network activity is substantially diminished.Contrarily, a failed recall has something to do with disagreement between some local decisions, resulting in decreased afferent signal coherence, which in turn leads to a much lower level of fi nal activity in the winner units.Their competitive infl uence is also weakened, leading to a higher overall network activity (Figure 11F).Thus, a simple comparison of the winner activities to their average level can already provide enough information to conclude about the quality of recall.The recall quality can be assessed on the global level of identity as well as on the component level, where either identity recognition failure or part assignment failure might be stated.

RECOGNITION PERFORMANCE
To assess the recognition capability of the memory, we evaluate the learning and generalization error of two different system confi gurations.These different confi gurations, the fully recurrent and purely feed-forward one, are set up to substantiate the hypothesis stating the functional advantage of the recurrent memory structure over the structure with purely feed-forward connectivity.Both confi gurations were trained under equal conditions and then tested to compare their performance against each other (refer to Open-Ended Unsupervised Learning and Performance Evaluation).
Both the purely feed-forward and fully recurrent confi gurations are able to successfully store the facial identities of the persons (40 in total) in the memory structure.Strong decay of the learning error over the time is clearly evident for both network confi gurations.The learning error rate falls rapidly in the early learning phase (fi rst 5 × 10 4 decision cycles) until it saturates at the values slightly below 5% in the later phase beyond 10 5 cycles (Figure 12).Although there is no signifi cant difference in the learning error rate between the both confi gurations after the saturation level is reached, the time needed to reach the saturation level is substantially shorter for the fully recurrent confi guration (saturates around 10 5 cycles) than for the purely feed-forward one (saturates around 1.5 × 10 5 cycles).Thus, the learning progresses about 33% faster for the fully recurrent system than for the purely feed-forward one.The fully recurrent confi guration seems to speed-up the learning progress in the critical early learning phase, probably taking benefi t of additional assistance provided by lateral and top-down connectivity for the organization, amplifi cation and stabilization of the memory traces.
At fi rst glance, analysis of the learning error time course suggests that the only functional advantage of the fully recurrent confi guration is the learning speed-up observed in the early phase.However, another important functional advantage is revealed if the generalization error rates are compared.The generalization error is measured on the alternative face views not shown during the learning phase (see Table 1).A striking result is the signifi cant discrepancy in performance between the two confi gurations manifested on the duplicate views containing emotional expressions (smiling and sad).There, the error rate difference is about 5% in favor of the fully recurrent memory confi guration.The generalization error of  (E) Course of mean activity in the columns.In the failed recall, a substantially increased overall activation is clearly seen as well as the shift of its broader peak to a later time point.(F) Winner unit activities at the end of the decision cycle on the left and mean unit activities (excluding the winners) over whole cycle on the right for each column.In the failed recall, winner activities are consistently lower, while the mean rest unit activities are consistently higher than in the successful recall.
purely feed-forward confi guration is 38.46% larger on the duplicate smiling view and 62.5% larger on the duplicate sad view than the generalization error of the fully recurrent confi guration.On the other views, no signifi cant difference in error rate can be detected between both confi gurations.
These results highlight an interesting property of the functional advantage as it has been assessed for the fully recurrent memory confi guration.The purely feed-forward confi guration falls signifi cantly behind the fully recurrent one only on certain views, performing comparably well on the others.Apparently, the stronger the deviation of the alternative view from the original view showed during the learning, the more evident is the enhancement in generalization capability.Even if given only a short time of a single decision cycle, the recurrent connectivity seems to gain benefi t particularly in novel situations, where purely feed-forward processing alone has more diffi culties in achieving correct interpretation of the less familiar face view.The purely feed-forward confi guration relies only on the local similarity computation, not being able to utilize evidence for likely compositions of the local parts to support the local decision making.Therefore, the interpretation of the global identity made on the basis of the less familiar local features from the alternative face views is more probable to suffer from the mistakes in local feature detection that cannot be corrected in the absence of contextual support, provided otherwise by the fully recurrent confi guration.

DISCUSSION
To identify potential neural mechanisms that are responsible for the formation of parts-based representations in visual memory, we examined the process of experience-driven structure selforganization in a model of layered memory.We chose the task of unsupervised open-ended learning and recognition applied to human faces from a database of natural face images.The fi nal goal was to build up a hierarchically organized associative memory structure storing faces of individual persons in a parts-based fashion.Employing slow activity-dependent bidirectional plasticity (Bienenstock et al., 1982;Artola and Singer, 1993;Cho et al., 2001) together with homeostatic activity regulation (Desai et al., 1999;Zhang and Linden, 2003) and a fast neuronal population dynamics with a strongly competitive nature, the proposed system performed impressively well on the posed task.It demonstrated the ability to simultaneously develop local feature vocabularies and put them in a global context by establishing associative links between the distributed features on the lower memory layer.On the higher layer, the system was able to use the confi gurational information about relatedness of the sparse distributed features to memorize the face identity explicitly in the bottom-up connectivity of identity units.The captured feature constellations were also projected back to the lower layer via top-down connectivity providing additional contextual support for learning and recognition.The identity recognition performance of the system on the original and alternative face views confi rmed the functionality of the established memory structure.

GENERIC MEMORY ARCHITECTURE
When thinking about the processes underlying the memory formation and function, it is remarkable that the structure and activity formation in the model network can be governed by a set of local mechanisms which are the same for all neuronal units and all synapses comprising the network.Saying that they are the same here means that for instance the bidirectional plasticity rule for any synapse has not only the same functional description, but also shares a common set of parameter values such as time constant, etc.This supports the view that the synapses arriving from different origins and contacting their target neuron at different sites of the dendritic tree and soma are a kind of universal learning machines, which may well differ in their impact on the fi ring behavior of the neuron (Sherman and Guillery, 1998;Larkum et al., 2004;Friston, 2005), while obeying the same generic modifi cation rules.Whether this is indeed the case, is currently a subject of intense debates (Sjöström et al., 2008;Spruston, 2008).Overall, the organization of the system supports the idea of universal cortical operations involving strong competitive and cooperative effects (von der Malsburg and Singer, 1988), which are building up on essentially the same local circuitry and the same plasticity mechanisms utilized in different cortical areas (Mountcastle, 1997;Phillips and Singer, 1997;Douglas and Martin, 2004).

COMPETITION AND COOPERATION IN ACTIVITY AND STRUCTURE FORMATION
In our study, it becomes clear that learning itself has to rely on certain important properties of the processing on the fast as well as on the slow time scale.To capture statistical regularities hidden in the local sensory inputs and their global compositions, there have to be mechanisms for selecting and amplifying only a small fraction of available neuronal resources, which then become dedicated to a particular object, specializing more and more for the processing of its local features and their relations.Without proper selection, no learning will succeed.However, without proper learning, no reasonable selection can be expected either.Here, we break this circularity by proposing strong competitive interaction between the units on the fast activity time scale.Given a small amount of neural threshold noise, this interaction is able to break the symmetry of the initial condition due to the bifurcation property of the activity dynamics (Lücke, 2005), enforcing the unit selection and amplifi cation in the initial learning phase even in the absence of differentiated structure.The response patterns enforced by competition offer suffi cient playground for the learning to ignite and move on to organize and amplify some synaptic structure that is suitable for laying down specifi c memory content via ongoing slow bidirectional Hebbian plasticity.In combination with competitive activity dynamics, the bidirectional nature of synaptic modifi cation assists further the competition between memory traces as it attempts to reduce the overlap between the patterns the network units preferentially respond to, segregating memory traces in the network structure whenever possible.The state of undifferentiated structure is however the worst-case scenario and not necessarily the initial condition for learning, as there may be basis structures prepared for the representation of many behavioral relevant patterns, like for instance faces (Johnson et al., 1991).Interestingly, the progress from an undifferentiated to a highly organized state via selection and amplifi cation of a small subset of totally available resources is a general feature in evolutionary and ontogenetic development of biological organisms.The notion that the very same principles may guide the activity and structure formation in the brain supports the view of learning as an optimization procedure adapting the nervous structure to the demands put on it by the environment (von der Malsburg and Singer, 1988;Edelman, 1993).Noteworthy, there is a very important difference in the way how the unit selection, or decision making, is implemented by competition given the early, immature or late, mature state of the connectivity structure.In the immature state where the contextual connectivity is not established yet, the local decisions in the lower layer bunch columns are made completely independent from each other.On the contrary, decision making in the mature state involves interactions between the local decisions via already established lateral and top-down connections.These associative connections enable cooperation within and competition between unit assemblies, promoting a coordinated global decision.The separation of synaptic inputs enables decision making to use information from different origins according to its functional signifi cance -carrying either sensory bottom-up evidence for a local appearance or providing clues for relational binding of distributed parts into a global confi guration (Phillips and Singer, 1997).The agreement between sensory and contextual signaling about the outcome of local decisions improves continuously as learning progresses, while the disagreement between them can be interpreted as an error signal, offering the possibility to modulate plasticity in an explicit errordependent way.The initially independent local decision making becomes thus orchestrated by contextual support formed in the course of previous experience with visual stimuli.

SIGNAL AND PLASTICITY COORDINATION
The coherency of cooperative and competitive activity formation cannot be guaranteed by the contextual support alone, as the time coordination of decision making across distributed units also matters.The decision cycle, which defi nes a common reference time window for decision making, orchestrates not only the activities, but also bidirectional synaptic modifi cations across the units.This reassures that structure modifi cation amplifi es the connections within the right subset of simultaneously highly active units encoding a particular face on the way to the peak of the decision cycle, while punishing the connections between the competitor units that were active during the cycle but were not able to survive until the end.The cortical processing seems to be reminiscent of oscillatory rhythms in the gamma range used here to model the decision cycle.Particularly, there is evidence that oscillatory activity may serve as reference signal coordinating plasticity mechanisms in cortical neurons (Huerta and Lisman, 1995;Wespatat et al., 2004).There is also support for a phase reset mechanism locking the oscillatory activity on the currently presented stimulus (Makeig et al., 2002;Axmacher et al., 2006).Taken together, current evidence suggests the possible interpretation of the gamma cycle as a rapidly repeating winner-take-all algorithm as it is modeled in this work (Fries et al., 2007).The winner-take-all competition can be carried out rapidly due to low latencies of fast inhibition and its result can be read out fast (on the scale of few milliseconds) due to the response characteristics of the population rate code (Gerstner, 2000).Here, the synchronization imposed by the common rhythms is selectively restricted to the small clusters embodied by the columns that participate in encoding or retrieval of some particular coarse object category or class (a face).Such assembly recruiting is experimentally observed and is supposed to happen within a gamma cycle, which, in turn, can occur on the top of a slower rhythm, e.g. in the theta range as known from hippocampal and cortical processing (Lisman, 2005;Sirota et al., 2008).So, decision cycles can be seen as fragments of a longer perception process, each fragment involving a selectively recruited assembly of synchronized units dedicated to the processing of that fragment.What neuronal processes are responsible for selective formation and coordination of the common rhythms is an important and challenging question, which has yet to be resolved and is a subject of our further studies.

HIERARCHICAL PARTS-BASED REPRESENTATION
An essential property of our memory system is parts sharing, as it allows the same basic set of the elementary parts to be used for the combinatorial composition of familiar and novel objects without the need to add new physical units into the system.Endowed with this ability, the memory network can be also interpreted as a layered neuronal bunch graph (Wiskott et al., 1997), without taking into account the topological information.Here, the graph nodes are columns, each holding a set of features with similar physical (visual appearance) or semantical (category or identity) properties (Tanaka, 2003).In such a graph, new object representations can be instantiated in a combinatorial fashion by selecting candidate features from each node.The candidate selection here depends critically on the homeostatic regulation of activity, which reassures that each unit is able to participate in memory formation to an equal extent.By introducing the hierarchy in the graph structure, higher order symbols, like identity of a person, can be explicitly represented by assigning the chosen set of candidate features from the lower memory layer to an identity unit on the higher layer.These higher symbols may be used for a compact representation of exceptionally important persons (VIPs), without discarding the information about their composition which is kept in the top-down connections projecting back to the lower layer.The identity column demonstrates the potential to develop this kind of representation in a restricted region of a higher visual area, where a single unit becomes extremely specialized for a particular object, as observed experimentally in IT, PFC and MTL (Freedman et al., 2003;Quiroga et al., 2008).We believe that such VIP units may exist for entities extensively dealt with on day-to-day basis, like intimate persons or a favorite teddy bear.But in principle, the system's functionality as associative memory doesn't need to rely on such a representational scheme, and it could use exclusively the distributed coding, utilized on the lower layer, without creating dedicated, localist-fashioned representations.Furthermore, each node itself is not stuck to the strict hard winner-take-all operation.Potentially, it would be also possible to select multiple candidates from a single node, or column, to encode a part of a particular face.Here we use very strong competition leading to a form of activity sparseness termed hard sparseness (Rehn and Sommer, 2007), limiting the number of active units to one per column.While this kind of sparse coding is advantageous for learning individual faces, it may be generally too sparse for representing coarser categories (like male of female).However, the competition strength can in principle be adjusted arbitrarily in a task-dependent manner, either by tuning the core unit gain or by balancing the self-excitation and lateral inhibition.The latter can be easily implemented by altering the amplitude of inhibitory or excitatory oscillations.The alteration could be initiated by some kind of internal cortical signal or state, indicating the task-dependent need for the competition strength.The tuning of the competition strength would allow the formation of less sparse activity distributions, representing the stimulus on a coarser categorical level (Kim et al., 2008).The intercolumnar cross-talk shouldn't become a problem in case of multiple active units per column after a preceding learning phase as long as the representation remains suffi ciently sparse, avoiding too much overall activation.

ATTENTIONAL AND GENERATIVE MECHANISMS IN THE MEMORY
Interestingly, contextual lateral and top-down connectivity endows the system with further general capabilities.For instance, selective object-based attention is naturally given in our model, because the priming of the identity units on the higher memory layer by preceding sensory or direct external stimulation would also prime and facilitate the part-specifi c units on the lower layer via top-down connections, providing them with a clear advantage in the competition against other candidates.This priming can mediate covert attention directed to a specifi c object, promoting the pop out of its stored parts-based representation while suppressing the rest of the memory content.Generally,the selection and amplifi cation by competition can be interpreted as an attentional mechanism, which focuses the neural resources on processing one object or category at the cost of suppressing the rest (Lee et al., 1999;Reynolds et al., 1999).Although not exploited in this study, the network model is also able to self-generate activity patterns that correspond to the object representations stored in the memory content in absence of any external input.This ability relies heavily on the lateral and top-down connectivity established by previous experience with visual stimuli, placing the model in remarkable relation to generative approaches explaining construction of data representations in machine learning (Ulusoy and Bishop, 2005).From this perspective, each face identity can be interpreted as a global cause producing the specifi c activity patterns in the network.The identities are in turn composed of many local causes, i.e. their constituent parts.The memory structure captures all the relations between local and global causes, being able to reproduce data explicitly in an autonomous mode.

PERFORMANCE ADVANTAGE OVER THE PURELY FEED-FORWARD STRUCTURE
Finally, we presented sound evidence for the functional advantage of lateral and top-down connectivity over the purely feed-forward structure in the memory formation and recall.First, the recurrent context-based connectivity seems to speed-up the learning progress.Second, and at least as essential, recurrent confi guration outperforms signifi cantly the purely feed-forward confi guration on the test views which deviate strongly from the original views shown during learning.This suggests that contextual processing is able to generalize over new data better than the purely feedforward solution, which performs comparably on original or only slightly deviating views.This outcome indicates that different processing strategies may prove more useful in different situations.While the recurrent connectivity is mostly benefi cial in novel situations, which require additional effort for the interpretation and learning of less familiar stimuli confi gurations, the feed-forward processing already suffi ces to do a good and quick job when facing well-known, overlearned situations, where effortful disambiguation is not required due to the strong familiarity of the sensory input.There, the feed-forward processing could benefi t from the bottom-up pathway structures formed by previous experience and evoke clear, unambiguous, easily interpretable activity patterns along the processing hierarchy without requiring additional contextual support from lateral and top-down connectivity.There are two predictions arising from this outcome, which can be tested in a behavioral experiment involving subordinate level recognition tasks.First, deactivation of lateral and top-down connectivity in the IT would not change performance for overlearned content, but would impair recognition for less familiar instances of the same stimuli viewed under different conditions, the impairment being the more visible the stronger the viewing condition deviates from the overlearned one.Second, the same deactivation should lead to a measurable decrease in the learning speed, increasing the time needed to reach a certain low level of recognition error.

MODEL PREDICTIONS
There are some more predictions that can be derived from the system's behavior.One general prediction is that failed memory recall should be accompanied by the higher overall activation along the IT processing hierarchy within the gamma or theta cycle, with the activity of the strongest units at cycle's peak being on contrary diminished.Reversely, a successful recall should be characterized by decreased overall activity in the IT and by increased activity in the winner units cluster.This is also interpretable in terms of signaling the degree of decision certainty, the successful recall being accompanied by greater certainty about the recognition result.Further, a failed recall should induce much more depression (LTD) than potentiation (LTP), a successful recall much more LTP than LTD on the active synapses.In addition, if required to memorize and distinguish very similar stimuli, the recall of such an item should lead to a higher overall activity in the IT network than for items with less similar appearance.The winner units, on contrary, should exhibit a reduced activation due to the inhibition originating from the competing similar content.Again, certainty interpretation of the activity level is possible here: the more similar the stimuli to be discriminated, the lower is the winner activation signaling the decision made, indicating lower certainty about the recognition result.An interesting prediction concerning the bidirectional plasticity mechanism is the erasure of a memory trace after repetitive stimulus-induced recall if LTD/LTP transition threshold is shifted to the higher values, for example due to an artifi cial manipulation, as performed in experiments of selective memory erasure in mice (Cao et al., 2008).So far, we provided a demonstration of experience-driven structure formation and its functional benefi ts in a basic core of what we think can be further developed into a full-featured, hierarchically organized visual memory domain for all kind of natural objects.As usual, several open questions remain, such as invariant or transformation-tolerant processing, development of a full hierarchy from elementary visual features to object categories and identities, establishing the interface for behaviorally relevant context as proposed in the framework of reinforcement learning, incorporating the mechanisms of active vision and so on.Nevertheless, with this work we hope we succeeded not only to highlight the crucial importance of coherent interplay between the bottom-up and top-down infl uences in the process of memory formation and recognition, but also to gain more insight into the basic principles behind the self-organization (von der Malsburg, 2003) of a successful subsystem coordination across different time scales.Aiming for real world applications, we believe that the incremental, unsupervised open-ended learning design instantiated in this work provides an inspiring and guiding paradigm for developing systems capable of discovering and storing complex structural regularities from natural sensory streams over multiple descriptional levels.

FIGURE 1 |
FIGURE 1 | Layered visual memory model.(A) Two consecutive interconnected layers for hierarchical processing.On the lower bunch layer (IT, each column contains n = 20 units), a storehouse of local parts linked associatively via lateral connections is formed by unsupervised learning.On the higher identity layer (PFC, column contains m = 40 units), symbols for person identities emerge, being semantically rooted in parts-based representations of the lower layer.The identity units provide further contextual support for the lower layer by establishing top-down projections to the corresponding part-specifi c units.(B) Different face views used as input to the memory (one person out of total 40 used for learning shown).Top left is the original view with neutral expression used for learning.Other views were used for testing the generalization performance (bottom row shows the duplicate views taken 2 weeks after the original series.).(C) Facial landmarks used for the sensory input to the memory, provided by Gabor fi lter banks extracted at each landmark point.
aim being equal to the target postsynaptic activity level (see Eq. 8).

FIGURE 4 |
FIGURE 4 | Time snapshots of structure formation.From left to right, snapshots from early, middle and late formation phase of (A) lower layer bottom-up connectivity containing local facial parts, (B) lower layer associative

FIGURE 5 |
FIGURE 5 | Differentiation time course over 5 × 10 5 decision cycles for different connectivity types; on the left the grade of differentiation, on the right its rate.Clear is the general tendency to greater connectivity differentiation with the learning progress as well as the temporal sequence of connectivity maturation (see the text).BU, LAT, hBU, TD denote respectively lower layer bottom-up, lateral, higher layer bottom-up and top-down connectivity types.

FIGURE 6 |
FIGURE 6 | Overlap (A) and sparseness (B) time course over 5 × 10 5 decision cycles for different connectivity types.As the learning progresses, the overlap between the receptive fi elds is continuously reduced, the connectivity sparseness increases.Again, the temporal sequence of connectivity development is clearly visible (see the text).BU, LAT, hBU, TD denote respectively lower layer bottom-up, lateral, higher layer bottom-up and top-down connectivity types.

FIGURE 7 |
FIGURE 7 | Time course of excitability regulation.Above the lower, below the higher memory layer.Obvious are the much stronger pronounced differences in excitability between the units on the lower layer.

FIGURE 9 |
FIGURE 9 | Activity formation during the decision cycle.(A) A sequence of six successive cycles, each representing a successful recall of a stored individual face.On the top, the activity course is shown, arrows pointing to constituent parts shared by two different face identities.Second and forth cycles show recall of the same face identity.Below is the mean activity course for each column and the oscillation rhythms defi ning the decision cycle.(B) A zoom into a single decision cycle (on the top) to visualize the activity formation phases.Below is the mean activity course for each column and distribution of average unit activities over the decision cycle showing the highly competitive nature of activity formation, where winner units get amplifi ed at the cost of suppressing the others.

FIGURE 10 |
FIGURE 10 | Improvement of signal coordination in the course of learning.Standard correlation coeffi cients ρ were computed for each signal pair.BU, LAT, TD denote respectively bottom-up, lateral and top-down signals.

FIGURE 11 |
FIGURE 11 | Coordination and activity formation in successful and failed recall.Two decision cycles showing failed and successful recall.(A) Network activity course.(B) Bottom-up afferent signals course.(C) Lateral and top-down afferent signals course.(D) Signal coordination course assessed by measuring the co-activation of bottom-up, lateral and top-down signals converging on the network units.In the failed recall, there is a clear break-down of signal coordination in afferents converging on the winner units.

Table 1 | Comparison of generalization error between the purely feed- forward and fully recurrent memory confi guration.
The confi gurations were tested after learning time of 5 × 10 5 cycles.Fully recurrent confi guration shows a signifi cantly better performance on the duplicate views with emotional expressions, while comparable performance is shown on the other views.