A Low-Rank Method for Characterizing High-Level Neural Computations

The signal transformations that take place in high-level sensory regions of the brain remain enigmatic because of the many nonlinear transformations that separate responses of these neurons from the input stimuli. One would like to have dimensionality reduction methods that can describe responses of such neurons in terms of operations on a large but still manageable set of relevant input features. A number of methods have been developed for this purpose, but often these methods rely on the expansion of the input space to capture as many relevant stimulus components as statistically possible. This expansion leads to a lower effective sampling thereby reducing the accuracy of the estimated components. Alternatively, so-called low-rank methods explicitly search for a small number of components in the hope of achieving higher estimation accuracy. Even with these methods, however, noise in the neural responses can force the models to estimate more components than necessary, again reducing the methods' accuracy. Here we describe how a flexible regularization procedure, together with an explicit rank constraint, can strongly improve the estimation accuracy compared to previous methods suitable for characterizing neural responses to natural stimuli. Applying the proposed low-rank method to responses of auditory neurons in the songbird brain, we find multiple relevant components making up the receptive field for each neuron and characterize their computations in terms of logical OR and AND computations. The results highlight potential differences in how invariances are constructed in visual and auditory systems.


Alternative constraint formulations
The advantage of the linear equality constraints (Eq 6) is that they provide strong optimality guarantees while scaling as a function of rD but come with the disadvantage of requiring the choice of π k for each pair U •,k and V •,k . There are several other possible equality constraints that satisfy rank(J sym ) = rank(J) and do not require choosing π k but they each are accompanied by their own disadvantages. A few possible alternatives are summarized below.
Two simple choices are the bilinear constraints UV T = VU T which enforce symmetry and the quadratic constraints the latter of which can be proven to satisfy rank(J sym ) = rank(J) by showing P N (U)V = 0 and P N (V)U = 0 (Fazel et al., 2003). The trouble with the bilinear and quadratic equality constraints is that they require optimizing D(D − 1)/2 and D(D + 1)/2 unique constraints making optimization with interior-point methods only tenable when D is small. Since the number of constraints does not depend on r, it means that block coordinate descent is not an option to decrease dimensionality in these cases. Furthermore, Hessian approximation techniques like L-BFGS usually do not attempt to approximate the Jacobian of the constraints and therefore will also not lead to a sufficient decrease in the dimensionality of the problem (Nocedal and Wright, 2006;Wächter and Biegler, 2006). With present memory and speed limitations, the best option in this case would be to use a constrained optimization technique that does not require second-order information; for instance, optimizing an augmented Lagrangian with conjugate gradient descent. In our experience, however, first-order methods may require more problem-specific tweaking of the optimization parameters (e.g. update parameters for the barrier function in an augmented Lagrangian method) to attain reasonable convergence speed compared to second-order methods. This may make first-order methods less attractive for large datasets.
Another possible alternative to the linear constraints is to instead insist that each U •,k and V •,k pair is parallel via the equality constraint which comes from setting the square of the dot product of normalized U •,k and V •,k to 1. The advantages of this form for the equality constraints is that it scales as r and is thus lower dimensional than the linear equality constraints while not requiring a choice of π k . Compared to the other constraints, however, this constraint is quite nonlinear as the Jacobians in the Hessian will depend explicitly on U and V. This constraint may still behave well in practice, but it is not clear if convergence to a solution can be guaranteed. Indeed, experiments with this formulation often got stuck at non-stationary points which would appear to indicate that it does not behave well at least with interior-point methods. There are other variations one can try, including discontinuous variations of Eq 24 where the power-2 is replaced by the absolute value, but we will not explore these further.
Lastly, an alternative form of the constraints that may be attempted is a modification of the linear equality constraints (Eq 6) where the discretization of the π k parameters is relaxed such that π k can take on any real number. In this case, π k would ensure that U •,k and V •,k are equal up to a signed scaling factor.
The downside of the relaxation is that one either needs to insert U •,k = −π k V •,k leading to a nonlinear optimization over a third-order polynomial or use the Lagrangian method in which case the Hessian of f (Eq 10) would be rank-deficient.

Optimality conditions
Here we show the conditions under which a stationary point of L (Eq 9) is a feasible local minimum of f (Eq 10) and derive some useful quantities for the subsequent discussion. In order to certify that a weight vector, is a feasible local minimizer of f (x), the first-order necessary conditions, also known as the Karush-Kuhn-Tucker (KKT) conditions, and the second-order sufficient conditions must be satisfied (Nocedal and Wright, 2006). The KKT conditions state that the gradient of L(x, Λ) with respect to x is zero and x must be a feasible point in weight space. The second-order sufficient conditions require that the Hessian of f (x) is positive semidefinite along feasible descent directions near x.
The necessary conditions for x to be a feasible local minimizer of f appear in Prop 1 where ∇ a , ∇ h , and ∇ Q •,k are the gradient operators with respect to a, h, and Q •,k , respectively, and is a quadratic feature matrix.
PROPOSITION 1. Karush-Kuhn-Tucker (KKT) conditions: the first-order necessary conditions for a feasible local minimum of f (x) are is the Jacobian of the constraint vector with respect to Q •,k and are the feasibility conditions.
With the constraints being linear and the Jacobian being full row-rank, all w k satisfy the linear independence constraint qualification regularity condition and therefore the KKT conditions are guaranteed to be satisfied at a feasible stationary point of L (Nocedal and Wright, 2006). Notably, when the KKT conditions are satisfied Q •,k is complementary to (i.e. is a null space component of) Eq 29 because Q T •,k A T k,k = w T k = 0 (Eq 8) and therefore where the right-hand-side of the arrow follows from the fact that the bracketed term is a symmetric matrix. In other words, Q •,k is complementary to the quadratic gradient found in the bracketed term. This also implies that the Lagrange multipliers, Λ •,k , are components of the null column-space of the partial Jacobian matrix, A k,k , for each k. This result will become important in the following discussion of locally/globally optimal regularization domains.
For nonconvex problems, the KKT conditions in Prop 1 are insufficient to guarantee that a stationary point is a feasible local minimum of f . Instead, we turn to the second-order sufficient conditions. PROPOSITION 2. Second-order sufficient conditions: if the KKT conditions in Prop 1 are satisfied at point x * in weight space and where A is the full rD × (1 + D + 2rD) Jacobian matrix of the constraints and N (A) returns the null space of A, then x * is a feasible local minimum of f .
Intuitively, the second-order sufficient conditions mean that the Hessian must be positive semidefinite along feasible descent directions arbitrarily close to the stationary point x * .
In terms of z t (Eq 1), the Hessian with respect to x may be written as where the Hessian operator is defined as is a symmetric block diagonal indefinite matrix and is a strictly positive semidefinite diagonal matrix. Due to the indefinite matrix, M (see definition in Eq 33), f is a nonconvex function.

Locally and globally optimal regularization domains
Generally speaking, the fact that the low-rank MNE optimization problem is nonconvex invites the possibility that there are multiple local minima, many of which may be suboptimal. However, since the nuclear-norm regularization penalty is convex, it can be shown that there is a regularization domain for which any solution to the low-rank MNE problem (Eq 8) is guaranteed to be globally optimal. We first observe that there is some value of the regularization parameters for which the Hessian of f (Eq 33) becomes positive semidefinite at a given x.
PROPOSITION 3. For a given x, there is a threshold value of k that satisfies k < λ max (M) where λ max (M) is the largest eigenvalue of M such that if all k are greater than or equal to this threshold then ∇ 2 xx f evaluated at x is guaranteed to be positive semidefinite.
PROOF. Under the assumption of fixed x, the characteristic polynomial of M + ∇ 2 xx * (Eq 33) is (using the block LDU decomposition) where λ is an eigenvalue of the Hessian and I is the identity matrix. For any λ that is a solution, there is a corresponding eigenvalue, λ = k ± |λ − k |, symmetric across k that is also a solution. Since all k ≥ 0, the minimum possible eigenvalue of M + ∇ 2 xx * at a given x occurs when any k = 0 and is equal to the smallest eigenvalue of M, λ min (M) ≡ −λ max (M). Therefore, if k ≥ λ max (M) for all k at a given x, then the Hessian is guaranteed to be positive semidefinite at x. Since λ min (∇ 2 xx f ) ≥ λ min (RR T ) + λ min (M + ∇ 2 xx * ), k = λ max (M) is an upper bound on the value of k above which the Hessian becomes positive semidefinite for a given x. Now, suppose we design a (nonlinear) semidefinite program (SDP) equivalent to the low-rank MNE problem (Eq 8), min a,h,X 1 ,··· ,Xr f (a, h, X 1 , · · · , X r ) where X k ≡ Q •,k Q T •,k is a 2D ×2D rank-one matrix with regularization parameter k , J k is the upper-right D × D block of X k , and X k ≥ 0 indicates X k is constrained to be positive semidefinite. In this form, the SDP is nonconvex with potentially many local minima. If, however, we relax the SDP by eliminating the rank constraint (or equivalently modifying it to rank(X k ) ≤ 2D for all k), then the problem becomes convex.
solved at x * then x * is a feasible global minimizer of the equivalent SDP (note that ∇ X k L is a 2D × 2D gradient matrix instead of a gradient vector).
PROOF. According to Prop 3, there is some value for each k above which the eigenvalue spectrum of 2∇ X k f is strictly positive for a given x. However, since ∇ X j L = ∇ X k L for any j and k, the threshold value of k such that the Hessian is positive semidefinite is the same for all k. Thus, if the minimal assigned value of k across all k satisfies Eq 38, the Hessian is positive semidefinite at x. If x * is a solution to the low-rank MNE minimization problem (Eq 8) and Prop 4 is true, then all ∇ X k f are positive semidefinite matrices and ∇ X k f X * k = 0 (Eq 31) at x * . It follows that x * is then a solution to the relaxed SDP and a global minimizer of f (Burer and Monteiro, 2003;Bach et al., 2008;Haeffele et al., 2014). This is true because the corresponding SDP weights for a, h, and all X k are a feasible local minimizer of the relaxed SDP along feasible descent directions shown via the first-order Taylor series expansion of f with respect to all X k about solution a * , h * , X * = k X * k : No feasible X k can locally decrease f because the trace of a product of two positive semidefinite matrices is greater than or equal to zero. Furthermore, since f is a convex function of a, h, and X k , this local information is sufficient to conclude that x * is at a feasible global minimum of f . Therefore, solutions to the low-rank MNE minimization problem (Eq 8) and the rank-constrained SDP (Eq 37) are globally optimal for a given rank and set of regularization parameters provided Prop 4 is satisfied.
Low-rank MNE models optimized to satisfy Prop 4 constitute a regularization domain of globally optimal solutions to the low-rank MNE minimization problem. Low-rank models within this globally optimal domain can be a consistent, good approximation to J of a given maximum rank r provided the training data is sufficiently representative of the underlying ground truth value of J. This can be particularly helpful when attempting to extract low-rank J from exceptionally high-dimensional problems that are impractical to solve at full-rank.
A secondary consequence of the above analysis is that, in the locally optimal regularization domain, solutions are unique along a given unit matrix,Q = Q Q F , where · F is the Frobenius norm.
PROPOSITION 5. Given a solution x * to the low-rank MNE problem, the associated quadratic weights, Q * , are a unique solution along the unit matrixQ * up to a change in sign of the columns (i.e. ±Q * •,k are equivalent solutions).
PROOF. Unlike in the globally optimal regularization domain where ∇ X k f is a positive semidefinite matrix at solution x * , in the locally optimal regularization domain ∇ X k f may be indefinite at a solution x * . However, f is still a convex function of the SDP weights, a, h, and X k . Furthermore, X * k is still complementary to ∇ X k f (Eq 39) at the solution. Therefore, X * k is a unique solution along unit matrixX * k . This unit direction matrix is represented in factorized space byX * k = (±Q * •,k )(±Q * T •,k ) =Q * •,kQ * T •,k . This factorization of the unit matrix,X * k , defines a unique direction in factorized space,Q * •,k , and thus Q * •,k from solution x * is also a unique solution up to a change in sign.
Intuitively, this means that a local minimum constrained to the unit directionQ is a global minimum alonĝ Q. In the case of degeneracy in Q, arbitrary rotations of degenerate vectors are globally optimal along any unit direction in this degenerate subspace. This result does not find use in this paper but we provide it here for completeness in case future advancements in numerical optimization techniques are able to exploit this structure to efficiently obtain certifiable global minima of problems in the locally optimal domain. At the moment, this property would theoretically decrease the number of iterations necessary to reach the global minimum using a branch and bound algorithm but the algorithm would still be at worst EXPTIME and impractical to implement for large D.
To investigate the existence of suboptimal local minima in the locally optimal domain, we took an empirical approach where we searched for a counterexample to the hypothesis that, despite nonconvexity, there are no suboptimal local minima. This was done by drawing random weights, x, from a normal distribution with D = 2 and r opt = 2. For each problem, N = 100 stimulus feature vectors were also drawn from a normal distribution. The response was generated using Eq 1. Then r = 1 low-rank MNE models were fit 10 times for each randomly drawn problem with different random initializations of the weights. If two models for a given random problem differed in negative log-likelihood by 1 · 10 −4 at their respective minima, then that problem qualified as a potential counterexample and was stored. To ensure that the difference in negative log-likelihood was not due to imprecise fitting, the detected counterexamples were verified by plotting f (Eq 10) in U space and observing the existence of spatially separated minima. We found that there are indeed counterexamples to the hypothesis and it is therefore possible that suboptimal local minima will exist in a given problem. However, we remark that suboptimal local minima are seemingly rare among problems drawn from a normal distribution, often requiring the generation of on the order of ∼ 10 3 random problems to find one with a suboptimal local minimum.

Convergence of the block coordinate descent algorithm
Since the interior-point algorithm (Nocedal and Wright, 2006) used to solve each block problem already guarantees global convergence to a feasible local minimizer of a given block of weights, x k , with fixed regularization parameter, k , each cycle through the r blocks (Eq 13) of the block coordinate descent algorithm leads to a monotonically decreasing series f (x r ) for the jth cycle. Because f ≥ 0 is bounded from below on the domain of the weights, the cost function value cannot decrease indefinitely and the series saturates as j → ∞ to a stationary point. At this stationary point, the gradient of the Lagrangian with respect to each block is simultaneously zero and all ∇ 2 x k x k f are simultaneously positive semidefinite when projected into the null space of the constraints, However, satisfying these conditions alone does not guarantee that this stationary point is a feasible local minimizer of the low-rank MNE problem (Eq 8). It can be shown that this stationary point is, in fact, a feasible local minimum of L by verifying that the necessary and sufficient conditions of the full problem (Prop 1 & 2 in appendix) are satisfied when the block KKT conditions (right-hand-side of Eq 14) and block second-order sufficient conditions (Eq 40) are satisfied.
In the following discussion, we will assume that the data has been prepared sufficiently according to Assumption 1, which may be done without loss of generality. We formally state the necessary and sufficient conditions under which a stationary point of the block coordinate descent is a feasible local minimizer of the low-rank MNE problem (Eq 8) in Prop 6. ASSUMPTION 1. The feature space satisfies 41) or has been transformed in such a way that this is true.
Intuitively, this assumption means that the covariance of the stimulus feature space must be full-rank. If the stimulus space does not satisfy this assumption, one can, for example, project the stimuli into the subset of principal components with non-zero variance without loss of generality.
PROPOSITION 6. If, for all k, x k are feasible local minima of the block subproblems (Eq 13) where the block KKT conditions and block second-order sufficient conditions S k ≥ 0 (Eq 40) are satisfied, then the equivalent weight vector, x, of the full problem is a feasible local minimizer of the low-rank MNE problem (Eq 8).
PROOF. (Part 1: necessary conditions) The KKT conditions in Eq 27, 28, & 29 are trivially satisfied when the block gradients are zero. Also, each constraint vector w k is only dependent on the kth block and therefore the feasibility conditions in Eq 30 are satisfied as well.
Showing that the second-order sufficient conditions are satisfied is a bit more involved. To make the proof less cumbersome, we define the abbreviations for 0 < i ≤ j ≤ r where i and j can be thought of as slicing indices (the same conditions apply to i and j ) that define a submatrix of a larger matrix. Under Assumption 1, R i:j,• R T i:j,• is positive definite provided all Q •,i , · · · , Q •,j = 0. If any Q •,k = 0 for i ≤ k ≤ j, then R i:j,• R T i:j,• is strictly positive semidefinite. The matrix Z i,j is generally an indefinite matrix and has rank-deficiency rank N (Z i,j ) ≥ rank ([Q •,i , Q •,i+1 , · · · , Q •,j ]) (recall from Eq 31 that Q •,k is complementary to Z k,k ). The matrix B is always positive definite.
In terms of these abbreviations, we rewrite S k (Eq 40) as and take the Shur complement over B which is positive semidefinite and therefore is positive semidefinite. Provided Q •,k = 0, it is much more likely that Θ k,k will be positive definite than strictly positive semidefinite given the structural dissimilarity of R k,• R T k,• and Z k,k . Even if Θ k,k is singular, an infinitesimal increase to k would theoretically produce a full-rank approximation to Θ k,k with negligible impact on the weights. From here forward, it is assumed that Θ k,k is positive definite unless Q •,k = 0.
We will now use the result in Eq 50 (and the surrounding discussion) and recursive application of the Schur complement to show that the second-order sufficient conditions (Prop 2) of the full problem are satisfied.