Criteria for Optimizing Cortical Hierarchies with Continuous Ranges

In a recent paper (Reid et al., 2009) we introduced a method to calculate optimal hierarchies in the visual network that utilizes continuous, rather than discrete, hierarchical levels, and permits a range of acceptable values rather than attempting to fit fixed hierarchical distances. There, to obtain a hierarchy, the sum of deviations from the constraints that define the hierarchy was minimized using linear optimization. In the short time since publication of that paper we noticed that many colleagues misinterpreted the meaning of the term “optimal hierarchy”. In particular, a majority of them were under the impression that there was perhaps only one optimal hierarchy, but a substantial difficulty in finding that one. However, there is not only more than one optimal hierarchy but also more than one option for defining optimality. Continuing the line of this work we look at additional options for optimizing the visual hierarchy: minimizing the number of violated constraints and minimizing the maximal size of a constraint violation using linear optimization and mixed integer programming. The implementation of both optimization criteria is explained in detail. In addition, using constraint sets based on the data from Felleman and Van Essen (1991), optimal hierarchies for the visual network are calculated for both optimization methods.


INTRODUCTION
In 1991, Felleman and Van Essen formalized the idea of a visual cortical hierarchy using a large number of tract tracing results obtained from macaque monkeys. Their general premise was that the laminar source and termination patterns of corticocortical projections contained information about their hierarchical directionality, which allowed projections to be labelled as ascending, descending, or lateral. Using this information as a constraint, the authors presented a cortical hierarchy in which most of these direction relationships were satisfi ed (see Figure 4, Felleman and Van Essen, 1991).
Since the publication of this article, the question emerged: what is the optimal visual cortical hierarchy? Using the same set of criteria and notion of optimality, Hilgetag et al. (1996) demonstrated that there are at least 100,000 hierarchies which are even more optimal than that introduced by Felleman and Van Essen (i.e., having less constraint violations). In Reid et al. (2009), we introduced a new approach to calculating hierarchies, which combined the laminar data from Felleman and Van Essen (1991) with new concepts from Vezoli et al. (2004) that permitted the additional representation of the hierarchical distance of a projection. Our approach utilized a continuous, rather than discrete scale for describing hierarchical Criteria for optimizing cortical hierarchies with continuous ranges With these inequalities we allow a deviation of Δ (u,v) from the constraints assigned to an edge (u, v) by the interval [x, y]. The Δ (u,v) is specifi c for every edge (u, v) which means one distinct variable Δ (u,v) for every edge is needed. Ideally, the value of these Δ (u,v) should be kept as small as possible, preferable 0. Since all Δ (u,v) measure a deviation, their values are always non-negative. Also note that at most one of the two conditions for an edge (u,v) can require an Δ (u,v) larger than 0. The objective is now to fi nd a hierarchy that best fi ts the data, i.e., with as little overall deviation as possible. To accomplish this the sum of all deviations Δ (u,v) should be minimal.
To calculate such an hierarchy we use a well known method called linear programming. A detailed introduction on linear programming can be found (for example) in the book of Papadimitiou and Steiglitz (1998). In brief the aim of linear programming is the optimization of a linear objective function, subject to linear equality and inequality constraints. Those linear problems have the following form, which is also called linear program: Maximize/minimize the expression c x c x c x n n 1 1 2 2 Here x 1 ,x 2 ,…,x n are variables, for which values need to be found. All other elements are constants.
There are different ways to solve these kinds of problems. The oldest and most widely used is the simplex algorithm which was developed in 1947 by Dantzig (see for example Dantzig, 1963). It has an exponential worst case run time but most instances can be solved much faster. Today the inner point method invented by Karmarkar (1984) is often used as well, which has a polynomial runtime. For this work the optimization procedure was performed using the Gnu Linear Problem Kit 1 which can be used for linear programming as well as mixed integer programming.

EXAMPLE
Consider the example in Figure 1. We are looking for a hierarchy function h: V → R with h (v 1 ) = 0 and for all (u, v) From these inequalities we receive the following two conditions for every edge: Since we want to calculate the values of the hierarchy function h for every v ∈V we also include h(v) as a variable in the linear program. To make the notation easier these variables will receive the names of the vertices in the linear program. So for every v ∈V there is a variable v that represents the function value h(v) in the program.

GRAPH REPRESENTATION AND THE HIERARCHY FUNCTION
Consider a network representing a hierarchy given as a directed graph G = (V, E), where V is a set of vertices and E a set of directed edges. Each edge in the graph is assigned a weight that signifi es the possible range of distances of its endpoints within the hierarchy (see Figure 1 for an example).
If the edge runs from vertex u to vertex v then an interval [x, y], x, y ∈R is assigned to that edge and the hierarchical distance between u and v should lie within this interval (see Figure 2). A positive distance value implies the edge is going up and v is above u in the hierarchy. A negative value means that the edge is descending with regard to the hierarchy and v is below u in the hierarchy. The value 0 signifi es that the edge does not cross levels in the hierarchy and u and v are on the same level.
To fi nd the hierarchy implied by these edge values we are looking for a function h: V → R so that for every edge (u, v) with assigned interval [x, y], x, y ∈R the following conditions hold: This function h is called the hierarchy function. However for the example in Figure 1 it is not possible to fi nd such hierarchy functions because the distance information is not consistent. The alternative is to fi nd a hierarchy function that "best" fi ts the data. To do so it is necessary to allow some deviations from the given data. To measure this deviation we introduce a variable Δ (u,v) for every edge (u, v). A variable Δ (u,v) measures for the two conditions defi ned by the edge (u, v) how much the hierarchy violates these conditions. This alters the condition for the hierarchy function resulting from an edge (u, v) with a range [x, y] in the following way (compare Figure 2 and Eq. 1):  ) .

and (5)
With those inequalities we can create a linear program to calculate an optimal hierarchy for the example in Figure 1. The objective of the program is to minimize the sum of all deviations. (see Figure 3, left: the variable sumΔ is the sum of all deviations.) The node v 1 is supposed to be the starting point of the hierarchy, and therefore its value is fi xed at 0. The variables v 1 to v 7 are not limited by the optimization objective (c20), and thus can become as big or small as necessary to attain the optimal value for sumΔ.

ADDITIONAL OBJECTIVES
So far our objective for fi nding an optimal hierarchy is to keep the sum of deviations as small as possible for the given edge values. But since there is normally more than one hierarchy that fulfi ls this objective, it can be useful to employ additional secondary objectives. We will illustrate this by an example.
Consider graph A of Figure 4. Since the sum of all edge values in that circle is 1 the minimal sum of deviations of any hierarchy for this graph is 1. The question is how that sum is best distributed on the involved edges. One option is to concentrate the deviation on as few edges as possible. That does not change the sum of deviations but it keeps the number of violated conditions small (see Figure 4B).
Another possibility can be seen in Figure 4C, there the deviation is distributed on all edges. This keeps the maximum deviation small and therefore all distances imposed by the hierarchy close to the original edge weights. Figures 4B,C are not the only options for an optimal hierarchy. Note that there is an infi nite number of other hierarchies with a minimal sum of deviations for this example.
The question is: which hierarchy describes the information given by the edge values best? The advantage of minimizing the number of violations is that most of the original distance information is preserved in the hierarchy, and we ideally disregard only the information that fi ts the model the least. Minimizing the maximum deviation, alternatively, has the advantage that all the information is treated equally and non is disregarded completely. The rationale for this is that it is better to change many distances a little than few distances a lot.
To implement these additional objectives, additional variables are needed in the program. In the fi rst case it is necessary to count and minimize the number of violations, which can not be done with just linear programming, but rather requires the use of mixed integer programming. For the second option we need to minimize the maximal deviation, which can easily be integrated in the linear program, so it will be discussed fi rst.

Maximal deviation
To calculate the maximal deviation of the hierarchy we introduce a variable Δ max which measures the maximal deviation from any constraints. To implement this every constraint of the original If one of the original constraints only holds if Δ (u,v) is greater than 0, then Δ max needs to be as least as big as Δ (u,v) for the doubled constraints to hold as well. Since this is true for every constraint it means Δ max needs to be at least as large as the largest edge specifi c deviation. If Δ max is included in the objective to be minimized it will take exactly the value of the maximal edge specifi c deviation Δ (u,v) .
Since the prime objective is still to minimize the sum of all deviations we introduce a factor in the objective to give the sum of deviations a bigger weight than the maximal deviation Δ max . The factor needs to be large enough that the smallest edge-specifi c deviation Δ (u,v) multiplied by the factor is considerably larger than Δ max . This ensures that Δ max is not of the same magnitude as the sum of deviations with regard to the optimization and therefore does not infl uence the primary optimization goal. The result is a hierarchy with a minimal sum of deviations, but among those hierarchies one with the smallest possible Δ max is chosen.
As factor we used 100 which proved to be large enough that the resulting hierarchy is still one with a minimal sum of deviations, so it fulfi lled the requirements outlined in the previous paragraph. The results for the graph from

Number of violations
To minimize the number of violations in a hierarchy it is necessary to count them within the program. To accomplish this we introduce an edge-specifi c integer variable B (u,v) which assumes the value 1 if Δ (u,v) > 0 and 0 if Δ (u,v) = 0. The sum of these violation counters can then be included in the objective to be minimized along with the sum of the deviations. These variables are not reals but integers, so it is necessary to use mixed integer programming. Mixed integer problems look like the linear problems we have seen so far (compare again to Eq. 3), but some of the variables can only take integer values. This gives the optimization problem a higher complexity: other than linear optimization problems, mixed integer problems can in general not be solved effi ciently, they are NP-hard. While there are well-known algorithms to solve these problems, one of the fi rst was developed by Land and Doig (1960), this does not mean that solutions can actually be found for all mixed integer problems as there are limitations of computer memory and time.
A problem with the implementation is that an integer variables B (u,v) could take any non-negative integer value, but we want them to only take the values 0 and 1. To ensure this we use a trick: For the program we double every constraint from the original program and get four conditions for an edge (u,v) with edge value [x,y] (compare Eqs 5 and 6): x x.
The two original conditions measure once again the size of a deviation from a constraint and the two new conditions test if there actually is a deviation. The factor 100 in front of the B (u,v) is much bigger than any number that actually occurs in the calculation. If we look at one of the conditions from Eq. 7, say v − u − 100·B (u,v) ≤ y then if v − u ≤ y the variable B (u,v) can assume the value 0. If on the other hand v − u > y then the variable B (u,v) needs to be at least 1 for the second inequality in Eq. 7 to hold. (Remember, unlike Δ (u,v) the variable B (u,v) is an integer and does not take values between 0 and 1.) Since the factor 100 is chosen to be a lot bigger than |v − u − y| the inequality v − u − y − 100·1 ≤ 0 always holds. Therefore, with B (u,v) = 1 the constraint is fulfi lled, and there is no need for any B (u,v) to be bigger than 1. By the same argument as above for Δ max we get if the sum of all B (u,v) is included in the objective to be minimized. The results for the graph from Figure 1 is shown in Figure 5B (compare to Figure 3).
Note that all three examples have the same sum of all deviations since it is primarily being optimized in all three cases. The differences lie in the second part of the objective, i.e. the maximal deviation and the number of deviations. Also note that the hierarchies shown here are again not the only optimal hierarchies that can be found for these criteria.

EMPIRICAL DATA
We use the data set described by Felleman and Van Essen (1991) for the visual system of the macaque monkey (FV91). As regions MDP and MIP had no constraints defi ned, they are not included here. The projections in this network were assigned ranges according to a modifi cation of the original relationship classifi cation scheme, which incorporates ideas presented in subsequent publications (Kennedy and Bullier, 1985;Barone et al., 2000;Batardiere et al., 2002), and permits a richer representation of hierarchical distance and the assignment of refi ned ranges (compare to Reid et al., 2009). As in Reid et al. (2009) we use the notation A+ for strongly ascending, A for ascending, L for lateral, D for descending and D+ for strongly descending projections. To investigate the effect of range sizes upon the resulting optimal hierarchies, ranges were varied from disjoint intervals with clear gaps between the fi ve connection types to intervals that overlap so that the connection types are merging into one another. To implement this ranges were systematically expanded by 0.1 at each limit, resulting in 10 range sets, as presented in Table 1. The outer bounds for these hierarchies were chosen as 32, which is the total number of regions; this ensures that no individual projection can have a hierarchical distance greater than the total number of regions.

RESULTS
The two optimization methods produced similar, but not identical results. Table 2 shows the values for the sum of deviations, the maximal deviation and the number of violations for the optimal hierarchies. For comparison the values from Reid et al. (2009) are also included. There only the sum of deviations was minimized and the optimization was performed using the QS-Opt Linear Problem Solver 2 . Note that the sum of deviations is the same for each set for all optimizations, since this was always the fi rst objective.
With the exception of number of violations for set 0 the optimized values are getting smaller when the size of the constraining intervals is getting bigger. The sum of deviations goes down from 60.0 for set 0 to 8.8 for set 9, the maximal deviation is reduced from 3.0 for set 0 to 0.7 for set 9 with optimization method 2, the number of violations decreases from 49 for set 1 to 13 for set 9 with optimization method 1 and from 50 for set 1 to 17 for set 9 with optimization method 2.
For three sets (0, 5 and 6) we get exactly the same optimal values for both optimization methods, but only for one of these sets (set 0) are the calculated hierarchies identical. For set 5 and set 6 the two calculated hierarchies are not identical. However, for each set both hierarchies have the same number of violations and maximal devia-tions, which means that they are both optimal hierarchies for both optimization methods. The corresponding hierarchies from Reid et al. (2009) all have the same maximal deviation but larger numbers of violations, so regarding the number of violations these hierarchies are not optimal. In general the number of violations for the hierarchies from Reid et al. (2009) are higher than for the new optimization methods, but since the violations were not minimized before this is not surprising. However the maximal deviations tend to be small, often minimal (as seen in comparison with optimization methods 2), for these hierarchies, even though they were not minimized.
For all other sets than 0, 5 and 6, we see differences in values between the two new optimization methods, therefore there is a trade-off between the number of violations and the maximal deviations. Therefore these values cannot both be at their minimum for 7 of the 10 sets.
When we look at the violated constraints we fi nd only a total number of 54 of the 386 constraints being violated by either of the two optimization methods. Of those nine constraints were violated by every set for both methods (see Table 3). Note that everything is counted as a constraint violation that does not exactly fi t the classifi cation of a connection. For example, if a connection is classifi ed as ascending (A) but ends up being strongly ascending (A+) in the calculated hierarchy this counts as a violation. Also note that for all the connections listed in Table 3 there are reciprocal connections  that were not consistently violated. This means that the classifi cation of these reciprocal connections is not complementary. For example the connection V2 to V3 is classifi ed A+, but the connection V3 to V2 is classifi ed D (not D+). The two optimization methods generally violated the same constraints: For the fi rst optimization method 53 and for the second optimization method 52 different constraints (of 386) were violated over the 10 sets and of those 51 constraints were violated by both methods. In addition constraint V1 to V3 (A+), was violated by the second optimization in four sets and constraints LIP to V4 (D), and 46 to TH (L), were each violated by the fi rst optimization in one set.
For comparison between sets we normalized the resulting hierarchies: by defi nition V1 always had the hierarchical value 0, the region with the highest hierarchical level was assigned the value 1, the hierarchical levels of the other areas were transformed accordingly. Figure 6 shows the average hierarchical levels over the 10 sets for the fi rst optimization method. The equivalent fi gure for the second optimization method looks very similar and is therefore not included here.

DISCUSSION
While the parameters for the hierarchies do show some differences between the optimization methods, the resulting hierarchies were remarkably similar. This does not mean that all hierarchies that are optimal for the two optimizations are similar: since we only have one possible solution per optimization method per set there might also be solutions that differ substantially. What it does mean is that there are examples for optimal hierarchies for the two optimization methods for which the differences between the optimized values seem to be accomplished through minor changes within the hierarchies.
The boundaries chosen for the optimization constraints (set 0 to 9) seem to have a big infl uence on the optimization results. All optimized values for set 0 are several times as big as the corresponding values for set 9. The "looser" the boundaries are (i.e., the larger the defi ning intervals) the smaller the optimized values. This is because in the bigger intervals for the connection classes more values can be assumed in the optimization without violating the constrains. For example, if an ascending connection is assigned the value 1.5, this is a violation in sets 0 through 4, but not in sets 5 through 9. Therefore there are fewer violations and, as a result, smaller optimized values for sets that use bigger intervals for connection classes. However in Reid et al. (2009) we found that the resulting hierarchies are very comparable across sets after normalization. There were only minimal changes in the order of the areas in the calculated hierarchies for the different conditions.
The similarity in the violated constraints across methods suggest that these constraints may be erroneous. In particular, of the nine constraints that are violated in this study by all constraint sets, and for both objectives, eight were violated in the calculations of Reid et al. (2009) for all constraint sets as well. [The last of the nine constraints, PO to LIP (A+), was violated by eight sets in Reid et al. (2009)]. While it is possible that a solution exists where FIGURE 6 | Geometrical distribution of hierarchy levels in the FV91 visual network of the macaque, both as three-dimensional cortical surface renderings (left), and as a two-dimensional "fl at map" representation of the cortical sheet (right). Regions are coloured by their mean normalized hierarchical position, obtained by the fi rst optimization method over the 10 constraint sets described in Table 1. Directed edges in the fl at map illustration represent interregional connections, and are coloured according to projection class: A+ (purple, opaque), A (purple, transparent), L (black), D (green, transparent), and D+ (green, opaque). Compare this representation to Figures 2 and 4 in Felleman and Van Essen (1991). none of these constraints are violated, this consistency suggests a pattern which is worth further investigation. The fact that all of these connections have reciprocal connections whose constraints are not consistently violated seems to suggest that the classifi cation of the reciprocal connections fi ts an optimal hierarchy better than the classifi cation of the nine violated connections. Since the classifi cation of the reciprocal connections is not complementary at most one of the classifi cations (the one for the connection or the one for the reciprocal connection) can be correct. This does not necessarily mean that the classifi cation is better for the reciprocal connection since in some cases this classifi cation is very broad, spanning several of our fi ve classes.
The classical optimization problem of minimizing the number of violations [as done by Felleman and Van Essen (1991) as well as Hilgetag et al. (1996)] can theoretically be solved by the method presented here, where it was only used as an secondary objective. In practical terms, solving for number of violations as a primary objective has proven too computationally expensive, whereas using it only as a secondary objective made the calculation easier, since the solutions were already limited by the minimal sum of deviations. However, we were able to minimize just the number of violations for sets 8 and 9, which produced the smallest optimal values for all the other optimization methods. The results (12 violated constraints with a sum of deviation of 13.4 for set 8, and 11 violated constraints with a sum of deviations of 10.7 for set 9) are only slightly below the number of violated constraints for optimization method 1.
The hierarchies calculated here are of course also solutions for the original optimization problem from Reid et al. (2009), since they have the same minimal sum of deviations. Additionally, however, they also have either a minimal number of violations or a minimal maximum deviation, and in some cases even both. This added optimality does not seem to be especially advantageous, however, given that the resulting hierarchies remained for the most part unchanged, which suggests that they are not highly sensitive to the addition of these criteria. Other objectives are also conceivable, of course, which may result in more strongly altered hierarchies than we report here. For instance, it is conceivable that further knowledge about the reliability of the anatomical data underlying our constraints may yield more informative objective criteria, that would allow a Bayesian approach to this problem. Another option is not to allow any deviation for connections that are classifi ed with a great certainty, ensuring that the connection appears as classifi ed in the hierarchy. However, this cannot be done for all connections since the resulting constraints are not consistent and some deviation needs to be allowed to fi nd an hierarchy.
In the choice of the optimization method the data quality should be considered. If we expect that most of the connections are correctly classifi ed but there might be some classifi cations that are erroneous then minimizing the number of violations is the better option. Most of the distance information of the classifi cation should then be preserved in the hierarchy. Assuming that the wrongly classifi ed connections are the ones that do not fi t the hierarchy these would be the ones that are disregarded. They would be "taken out" of the hierarchy. If on the other hand, we expect all the classifi cations to be equally correct or faulty (or just an approximation of the true value), then we want to consider all the information to the same degree. This can imply changing many classifi cations a little to "squeeze" the information into a hierarchy, such that, ideally, none of the information is disregarded completely, while many of the classifi cations might be "corrected" a little.
Since our results produced highly similar hierarchies across a variety of constraint sets for the two optimization methods presented here, both appear equally suited for the calculation of optimal hierarchies. It is thus a matter of personal preferences which one to employ, and this choice is one that can conceivably be expanded to accommodate alternative defi nitions of optimality. This leads us to conclude that there is no unique optimal hierarchy, not only because of the quality of the empirical data and the freedom to choose boundaries for the defi ning intervals, but also because there is more than one way to defi ne optimality. While the method described does not provide a unique optimal hierarchy, it can produce hierarchies that are optimal in more than one way.

ACKNOWLEDGMENTS
Rolf Kotter and Egon Wanke received funding from the DFG (KO 1560/6-2, WA 674/10-2). Andrew T. Reid, Gleb Bezgin and Rolf Kotter were funded by a collaborative network grant from the McDonnell Foundation. The authors acknowledge funding received by the Initiative and Networking Fund of the Helmholtz Association within the Helmholtz Alliance on Systems Biology. could be construed as a potential confl ict of interest.