Correlation Constraints for Regression Models: Controlling Bias in Brain Age Prediction

In neuroimaging, the difference between chronological age and predicted brain age, also known as brain age delta, has been proposed as a pathology marker linked to a range of phenotypes. Brain age delta is estimated using regression, which involves a frequently observed bias due to a negative correlation between chronological age and brain age delta. In brain age prediction models, this correlation can manifest as an overprediction of the age of young brains and an underprediction for elderly ones. We show that this bias can be controlled for by adding correlation constraints to the model training procedure. We develop an analytical solution to this constrained optimization problem for Linear, Ridge, and Kernel Ridge regression. The solution is optimal in the least-squares sense i.e., there is no other model that satisfies the correlation constraints and has a better fit. Analyses on the PAC2019 competition data demonstrate that this approach produces optimal unbiased predictive models with a number of advantages over existing approaches. Finally, we introduce regression toolboxes for Python and MATLAB that implement our algorithm.


EQUAL COEFFICIENTS FOR MODELS WITH AND WITHOUT INTERCEPT
The coefficients vector β 1:p (excluding the intercept β 0 ) obtained from a model including an intercept is the same as the β obtained from a model without an intercept but where the X and y have been centered. To prove this, let us start with Ridge regression. First, let us consider a model without intercept where we are centering the data usingx = 1 n X 1 ∈ R p (column means of X) andȳ = 1 n y 1 ∈ R (mean of y). The loss function and its gradient are given by Setting the gradient to zero and rearranging we arrive at Conversely, let us assume that the data has not been centered and the model includes an intercept β 0 , that is β = [β 0 , β 1:p ]. We will not augment X with a column of ones but instead write the intercept explicitly. This leads to the loss function L(β) = ||y − Xβ 1:p − 1 β 0 || 2 2 + λ ||β 1:p || 2 2 .
Comparing Eq. (S1) and Eq. (S2) we can see that the solutions for models with and without intercept are identical. This result immediately generalizes to OLS (set λ = 0) and Kernel Ridge regression (replace X by Φ(X)).

ZERO CORRELATION CONSTRAINT
Here, we use the Lagrangian multipliers approach to derive the solution for a zero correlation constraint. We will start with the OLS model and then show the equivalent results in Ridge and Kernel Ridge regression. In matrix notation, the constrained OLS optimization problem can be written as where the scaling factor 1 2 does not affect the solution. Note that corr(y, y −ŷ) = 0 implies zero covariance, that is cov(y, y −ŷ) = 0. If we expand the covariance term the constraint simplifies to We can now use the Lagrangian multiplier µ to incorporate the constraint into the objective function: The gradient of L w.r.t to β is given by Setting the gradient to zero, writing θ = 1 − µ and solving for β yields b = θ (X X) −1 X y (OLS).
In other words, the solution is the standard OLS solution β ols = (X X) −1 X y scaled by the factor θ.
Eq. (S4) still holds, but the hat matrices are given by In all cases, the loss function is convex and the constraint is linear, so the optimization problem is convex. Therefore, the first order stationarity condition is sufficient and the solution represents a unique global minimum.
Setting the derivative to zero gives us the solution Since the last term is a scalar, the solution, if it exists, is again a scaled version of the OLS solution. The same result is obtained for the case corr(y, e) = −ρ, with a sign reversal for the terms containing ρ. In other words, we are looking for an optimal scaling term θ ∈ R. Since the scaling of β implies scaling ofŷ by the same amount, we can find the solution by performing a line search and solving Eq. (S6) for θ: After squaring both sides and rearranging one obtains We can divide by c := (y ŷ) 2 − ρ 2 ||y|| 2 ||ŷ|| 2 and complete the square. This requires that c = 0 and hence corr(y,ŷ) = ρ 2 , which needs to be verified first. We then arrive at the two solutions θ 1,2 = ||y|| 2 y ŷ (1 − ρ 2 )/c ± ||y|| 2 |c| ρ 2 (1 − ρ 2 ) (||y|| 2 ||ŷ|| 2 − (y ŷ) 2 ).