Identifying Imaging Markers for Predicting Cognitive Assessments Using Wasserstein Distances Based Matrix Regression

Alzheimer's disease (AD) is a severe type of neurodegeneration which worsens human memory, thinking and cognition along a temporal continuum. How to identify the informative phenotypic neuroimaging markers and accurately predict cognitive assessment are crucial for early detection and diagnosis Alzheimer's disease. Regression models are widely used to predict the relationship between imaging biomarkers and cognitive assessment, and identify discriminative neuroimaging markers. Most existing methods use different matrix norms as the similarity measures of the empirical loss or regularization to improve the prediction performance, but ignore the inherent geometry of the cognitive data. To tackle this issue, in this paper we propose a novel robust matrix regression model with imposing Wasserstein distances on both loss function and regularization. It successfully integrate Wasserstein distance into the regression model, which can excavate the latent geometry of cognitive data. We introduce an efficient algorithm to solve the proposed new model with convergence analysis. Empirical results on cognitive data of the ADNI cohort demonstrate the great effectiveness of the proposed method for clinical cognitive predication.


INTRODUCTION
Alzheimer's disease (AD), the most common form of dementia, is a Central Nervous System (CNS) chronic neurodegenerative disorder with progressive impairment of learning, memory and other cognitive function. As an incurable disease which severely impacts human thinking and behavior, Alzheimer's disease is the 6th cause of death in the United States (Alzheimer's Association, 2018). Along with the rapid progress in high-throughput genotype and brain image techniques, neuroimaging has been developed to effectively predict the progression of AD or cognitive performance in plentiful research (Ewers et al., 2011;Wang et al., 2011b), which benefits for early diagnosis and explorition of brain function associated with AD (Petrella et al., 2003;Avramopoulos, 2009). The Alzheimer's Disease Neuroimaging Initiative (ADNI) (Mueller et al., 2005;Jack et al., 2008) provides neuroimaging and cognitive measurement of normal aging, mild cognitive impairment as well as AD samples, which provides a wealth of resources for the study of Alzheimer's diagnosis, treatment and prevention.
Until now, numerous studies (Eskildsen et al., 2013;Moradi et al., 2015) have utilized neuroimaging techniques to detect pathology associated with AD. Among them, structural magnetic resonance imaging (MRI) is the most extensively used imaging modality in AD related studies because of its completely noninvasive nature, high spatial resolution, and high availability. Thus, researchers have extracted plentiful MRI boimarkers in classifying AD patients in different disease over the past few years (Duchesne et al., 2008;Eskildsen et al., 2013;Guerrero et al., 2014). And these abundant MRI boimarkers have been used to many AD related studies, such as AD status prediction and MCI-to-AD conversion prediction. Despite of great efforts, we still cannot identify informative AD-specific biomarkers for the early diagnosis and prediction of disease progression. The reason for this is that the number of clinical status of AD is small, which makes it difficult to observe and understand the cognitive progression.
Consequently, many studies use clinical cognitive tests to measure cognitive assessment. Recently, several clinical tests have been presented to access individual's cognitive level, such as Trail making test (TRAILS) and and Rey Auditory Verbal Learning Test (RAVLT) (Schmidt, 1996). Through predicting the cognitive scores with MRI biomarks, we can explore the association between imaging biomarkers and AD and find informative ADspecific biomarkers. Therefore, a wide range of machine learning approaches have been proposed to predict the cognitive scores and uncover the pathology associated with AD (Wang et al., 2011a(Wang et al., , 2016Moradi et al., 2017).
In the current study of predicting cognitive scores with longitudinal phenotypic markers extracted from MRI data, regression method has been demonstrated as a effective way to excavate the correlation between cognitive measures. To modify the traditional regression model, recent methods proposed to integrate novel regularization term (such as sparse regularization and low-rank regularization) into the traditional regression model (Obozinski et al., 2010;Jie et al., 2015;Moradi et al., 2017). In fact, the intrinsic idea of the study mentioned above is utilizing different matrix norm or the combination of matrix norms as the similarity measures of the empirical loss or regularization to fit the prior assumption of neuroimaging markers. Though the effectiveness of specific matrix norm as regularization, these matrix norms simply meet the assumption rather than make full use of the inherent geometry of the data. Thus, it is easy to achieve a suboptimal solution for these models.
To tackle this problem, in this paper we consider Wasserstein distance as distance metric for regression model. Different from L p distances (p ≥ 0) (Luo et al., 2017) or Kullback-Leibler (Csiszár and Shields, 2004) and other f -divergences (Ali and Silvey, 1966), Wasserstein distance is well-defined between any pair of probability distributions over a sample space equipped with a metric. Thus, it provides a meaningful notion of distance for distributions supported on non-overlapping low dimensional manifolds. For better performance of cognitive score predication, we propose to substitute Wasserstein distance for matrix norm.
Although successfully applied to image retrieval (Rubner et al., 2000), contour matching (Grauman and Darrell, 2004), cancer detection (Ozolek et al., 2014), super-resolution (Kolouri and Rohde, 2015), and many other problems, there is an intrinsic limitation of Wasserstein distances. In fact, Wasserstein distances are defined only between measures having the same mass, which makes it difficult to applied Wasserstein distance into cognitive score prediction. To overcome such a limitation, many existed study Rossi, 2014, 2016;Kondratyev et al., 2016), have been proposed. However, these methods are all based on distributions or histogram features of data. As we know, in cognitive score prediction, we usually use the original features rather histogram features to learn the regression model parameters. Additionally, most of these methods use traditional matrix norm to characterize model parameters in Wasserstein distance loss minimization problem. This often leads to suboptimal results since matrix norm is usually sensitive to real noise.
To perfectly integrate Wassterstein distance into regression model for better performance of cognitive score prediction, in this paper we propose a novel efficient and robust Matrix Regression method to employ Joint Wasserstein distances minimization on both loss function and regularization (JWMR for short). Different from the existing methods, which need to extract histogram features of data in the preprocessing stage and then calculate Wasserstein distances based on them, our method considers histogram operator as an important component of objective function and uses it to constrain loss term and the estimated model parameters which are generated by original data features. This is the first time for exploiting Wasserstein distance as loss and regularization terms. As a result, our method is more reliable and applicable than traditional regression method using ℓ p -norm regularizer. We derive an efficient algorithm based on a relaxed formulation of optimal transport, which iterates through applications of alternating optimization. We provide the convergence analysis of our algorithm and describe a statistical bound for the proposed new model. We apply our method on cognitive data of the ADNI cohort and obtain promising results.
Our main contributions are three-fold: (1) The proposed robust matrix regression via joint Wasserstein distances minimization to circumvent the natural limitation of matrix norms in regression model; (2) The proposed model is suitable for revealing the relationship between cognitive measures and neuroimaging markers; (3) Because our method not only includes composition of W(·, ·), but also the computations of Wasserstein distances with regard to different terms, we derive an efficient algorithm to solve this problem with convergence analysis.

Notations
We summarize the notations and definitions used in this paper. Matrices are written as boldface uppercase letters. · F and · * denote Frobenius norm and nuclear norm, respectively. ·, · is the inner product operation. e ∈ R m is a column vector of ones. 0 ∈ R m is a column vector of zeros. For vector m ∈ R m , its i-th element is denoted by m (i) . For matrix M ∈ R n×m , its i-th row, j-th column and (i, j)-th element are denoted by m i , m j , and m ij . The ℓ 2,1 -norm of M is defined as where m i 2 denotes the ℓ 2 -norm of the vector m i . We define the Kullback-Leibler (KL) divergence between two positive vectors by where / denotes the element-wise division.

Matrix Regression for Cognitive Score Prediction
In the association study of predicting cognitive scores from imaging markers, a wide range of work has employed regression models to uncover the relationship between neuroimaging data and cognitive test scores and predict cognitive score. Given the imaging feature matrix A ∈ R m×l and the cognitive score matrix Y ∈ R l×n , a common paradigm for regression to predict cognitive score is to minimize the penalized empirical loss: where λ > 0 is the balance parameter, Z ∈ R m×n is the weight matrix, which is estimated from the imaging feature matrix A and the cognitive score matrix Y to capture the relevant features for predicting the cognitive scores, L(Y − A T Z) is the empirical loss on the training set, and (Z) is the regularization term that encodes imaging feature relatedness. Different assumptions on the loss L(Y − A T Z) and variate Z lead to different models. The representative model include: Least Squares Regression (LSR) (Lu et al., 2012): Low Rank Representation (LRR) (Liu et al., 2010): Feature Selection Based on ℓ 2,1 -norm (Nie et al., 2010):

Feature Selection for Informative Imaging Marker Identification
Due to the progress and prosperity of brain imaging and highthroughput genotyping techniques, a large amount of brain imaging data is available and a great quantity of imaging markers is alternative to predict cognitive score. However, not all of them are related to the pathological changes specific to AD, namely some imaging markers are redundancy for the prediction task.
A forthright method to tackle this problem is to perform feature selection, which aims to choose a subset of informative features for improving prediction.
Feature selection has been demonstrated as a efficient way to reflect the correlation between cognitive measures after removing the non-distinctive neuroimaging markers. Regression techniques with specific regularization can also used to identify discriminative imaging markers. For instance, sparse regression models have been extensively utilized to select discriminative voxels for AD study in previous works (Guerrero et al., 2014;Liu et al., 2014;Xu et al., 2017). Many sparse-inducing norm have been iterated into the spare regression model: ℓ 1 shrinkage methods such as LASSO can identify informative longitudinal phenotypic markers in the brain that are related to pathological changes of AD (Liu et al., 2014); group LASSO with a ℓ 2,1 -norm can select the most informative imaging markers related to all participants including AD, mild cognitive impairment (MCI) and healthy control (HC) by imposing structured sparsity on parameter matrix (Jie et al., 2015); ℓ 1,1 -norm regularization term can achieve both structured and flat sparsity (Wang et al., 2011a).
Nevertheless, matrix norms such as ℓ 1 -norm, ℓ 2,1 -norm, and ℓ 1,1 -norm have the natural limitation that they can not take the inherent geometry of the data into account. On this account, we need to select a new distance metric to measure the empirical loss and regularization term. In this paper, we choose the smoothed Wassersetein distance as the distance metric.

Smoothed Wasserstein Distance
Wasserstein distance, originally introduced in Monge (1781), is a powerful geometrical tool for comparing probability distributions. It is derived form the optimal transport theory and is intrinsically the optimal solution of transportation problem in linear programming (Villani, 2008).
In a more formal way, given access to two sets of points , we construct two empirical probability distributions as followŝ where p S i and p T i are probabilities associated to x S i and x T i , respectively, and δ x is a Dirac measure that can be interpreted as an indicator function taking value 1 as the position of x and 0 elsewhere. For these two distribution, the polytope of transportation plans between X S and X T is defined as follows: Given a ground metric matrix C ∈ R N S ×N T + , the optimal transport consists in finding a probabilistic coupling defined as a joint probability measure over X S × X T with marginalsμ S andμ T that minimize the cost of transport min P∈Uμ S ,μ T C, P , where P = p(i, j), i = 1, · · · , N S , j = 1, · · · , N T is the flownetwork matrix, and p(i, j) denotes the amount of earth moved from the source X S to the target X T . This problem admits a unique solution P * and defines a metric on the space of probability measures (called the Wasserstein distance) as follows: Optimizing Wasserstein distance problem requires several costly optimal transport problems. Specialized algorithm can solve it (Orlin, 1993). To solving the computational problem, recent works have proposed novel method to accelerate the calculation procedure. Furthermore, as a minimum of affine functions, the Wasserstein distance itself is not a smooth function of its arguments. To overcome the above problems, Cuturi (2013) proposed to smooth the optimal transport problem with an entropy term: where γ > 0 and e(·) is the entropy function: e(P) = − P, log(P) .
With the entropy term, we can use Sinkhorn-Knopp matrix scaling algorithm to solve the optimal transport problem (Sinkhorn and Knopp, 1967).

MATRIX REGRESSION BASED ON JOINT WASSERSTEIN DISTANCE
In the above formulations, the loss term and estimated variate are characterized via the simple matrix norm. Thus, these models can be easily solved by conventional convex optimization methods [e.g., ADMM (Liu et al., 2010), gradient based methods (Bubeck et al., 2015), and reweighted iterative methods (Nie et al., 2010)]. However, they do not take into account the geometry of the data through the pairwise distances between the distributions' points. Accordingly, these models often achieve the suboptimal results in cognitive score predication.

Joint Wasserstein Matrix Regression
Comparing with matrix norm, Wasserstein distance can circumvent the above limitation. Therefore, in this paper we propose to use Wasserstein distance to jointly characterize loss term and estimated variate Z, which is formulated as where h(·) and Y i denote the histogram operator and ith row of matrix Y, respectively. It should be noted that we use the histogram operator to constrain each variable in model (13).
Due to the relax operation in (14), we can straightly utilize the original data A T Z, Y, and Z in model (16). Thus, we do not need to extract the histogram features of data in the preprocessing stage, which makes it suitable for the prediction task in neuroimaging data. Strong convexity of model (16) is given by the entropy terms KL(P|K). Thus, we propose to solve (16) by block coordinate descent, alternating the minimization with respect to the parameters {P (1) , · · · , P (l) ,P (1) , · · · ,P (m) } and each Z i , which can be updated independently and therefore in parallel. This is summarized in Algorithm 1. We now detail the two steps of the procedure.
Updating coefficient matrix Z. Minimizing with respect to one Z i while keeping all other variables fixed to their current estimate yields the following problem Recalling the definition (2), it is easy to calculate the gradient of objective (17) with regard to each Z i . Thus, we can use accelerated gradient descent (Bubeck et al., 2015) to optimize problem (17). Updating parameter set {P (1) , · · · , P (l) ,P (1) , · · · ,P (m) }. For fixed Z, the update of each P (i) andP (i) boils down to an OT problem, which can be solved via Sinkhorn iteration (Cuturi, 2013). These steps are summarized in Algorithm 2, where we list the detailed iteration process for each P (i) . For eachP (i) , we need to replace (A T Z) i and Y i with Z i and 0.

Convergence Analysis
Following Sandler and Lindenbaum (2011), we can derive the theorem as follow.
Theorem 1. Algorithm 1 converges to a local minimum.
Proof: Algorithm 1 is the alternative iteration with two iteration stage. In the first stage, we can use gradient descent to solve the convex problem (17). Thus it is obvious that it has a feasible solution. And in the second stage, the problem is a sequence of linear programming processes. As shown in (Sandler and Lindenbaum, 2011), there is a feasible solution for every one of them. To sum up, a feasible solution for (16) exists.
J(Z; P (1) , · · · , P (l) ,P (1) , · · ·P (m) ) is convex, so applying (17) can derive globally optimal Z k when given a {P (1) , · · · , P (l) ,P (1) , · · ·P (m) } k−1 , where k denotes the iteration time. Besides, linear programming minimizes the flow-network matrix P andP. Thus, we can find global optimal P k andP k for a give Z k−1 . Furthermore, the accelerated gradient descent used The bold values indicate the minimal value in the raw (i.e., the best performance among these methods). The bold values indicate the minimal value in the raw (i.e., the best performance among these methods).
to update Z and the Sinkhorn Iteration used to update P,P both have been proven converge. Since the objective in these two stage is the same, In above, every iteration of Algorithm 1 monotonically decreases J(Z; P 1 , · · · , P (l) ,P 1 , · · ·P (m) ). This objective is lower bounded, and therefore the algorithm converges.

EXPERIMENTAL RESULTS
In this section, we evaluate the prediction performance of our proposed method by applying it to the Alzheimer's Disease Neuroimaging Initiative (ANDI) database (adni.loni.usc.edu), where a plenty of imaging markers measured over a period of  v i ← ((Y i ) T /K T u i ) 6: until convergence 7: P (i) ← (p (i)jt ) n×n , where the (j, t)-th element of P (i) is p (i)jt = u i(j) k (i)jt v i(t) 8: end for 2 years are examined and associated to cognitive scores that are relevant to AD.

Data Description
The data used in the preparation of our work were obtained from the ADNI cohort. As we know, two widely employed automated MRI analysis techniques were used to process and extract imaging phenotypes form scans of ADNI participants (Shen et al., 2010). One is Voxel-Based Morphometry (VBM) (Ashburner and Friston, 2000), which was performed to define global gray matter (GM) density maps and extract local gray matter density values for 90 target regions. The other one is automated parcellation via FreeSurfer V4 (Fischl et al., 2002), which was conducted to define volumetric total intracranial volume (ICV). All these measures were adjusted for the baseline ICV using the regression weights derived from the healthy control (HC) participants. In this study, there are 805 participants, including 186 AD samples, progressive mild cognitive impairment (pMCI) samples, 167 stable mild cognitive impairment (sMCI) samples and 226 health control (HC) samples. In our work, we adopt FressSurfer markers and VBM markers as imaging phenotypes. Furthermore, the longitudinal scores were downloaded form three independent cognitive assessments including Fluency Test, RAVLT, and TRAILS. The details of these cognitive assessments can be found in the ADNI procedure manuals. The detailed information are shown in Table 1.

Performance Comparison on the ADNI Cohort
To evaluate the performance of our model, we compare it with the following related methods: RR (multivariate ridge regression), ℓ 2,1 (robust feature selection based on ℓ 2,1 -norm), RSR (Regularized Self-Representation) (Zhu et al., 2015), and RLRSS (Robust Low-Rank Structured Sparse Model) (Xu et al., 2017). These comparing methods are all widely used in statistical learning and brain image analysis.
In the experiments, we use ridge regression for the prediction experiment after selecting the top related imaging markers. We tune the hyper-parameter of all models in the range of {10 −4 , 10 −3 , · · · , 10 4 } via nested five-fold cross-validation strategy, and report the best result of each method. To measure prediction performance, we compute the root mean square error (RMSE) between the predicted score and the ground truth.
The average results for each method are reported in Figure 1, while, we also list the RMSE using the top 10 and 30 imaging markers and reported in Tables 2, 3. It can be seen that our ap proach obviously outperforms most of methods significantly. Different matrix norms fit different assumption of the cognitive measures, which makes it enable to uncover part of the correlation of cognitive measures. However, due to the natural limitation of the matrix norms, they fails to uncover the inherent geometry of the cognitive data. As for our proposed method, with the effectiveness of Wasserstein distance, it can well utilize the inherent geometry to reveal the underlying relationship between cognitive measures and neuroimaging markers.

Identification of Informative Markers
The primary goal of the proposed method is to identify the discriminative AD-specific imaging biomarkers which is crucial for early detection, diagnosis and prediction of AD. Therefore, we examine the neuroimaging markers selected by our method and show it in Figure 2. Visualizing the parameter weights shown in Figure 2 can help us locate the informative markers which play important roles in the corresponding cognitive prediction tasks. As the heat map in Figure 2 shows, different coefficient values are represented in different colors. The yellow polar means a significant effect of corresponding markers on cognitive score performance.
As the Figure 2 shows, the extracted informative imaging biomarks are highly AD-specific and effective for related studies of AD, since it actually meets with the existing research findings. For example, among the top selected features, we found that hippocampal volume (HippVol) and middle temporal gyrus thickness (MidTemporal) are on the top, whose impact on AD have already been proved in the previous papers (Braak and Braak, 1991;West et al., 1994). Furthermore, it also confirms the important significance of the selected neuroimaging cognitive associations to uncover the relationships between MRI measures and cognitive levels.

Visualization of Top Identified Imaging Markers
As shown in Figure 3, we also visualize the top ten selected features for RAVLT memory score prediction on brain map as a demonstration. In the brainmap for FreeSurfer, the top 15 brain regions are (in descending order according to the ℓ 2 -norm of feature weights): LPrecuneus, RCerebellWM, LHippVol, RCerebellCtx, RMedOrbFrontal, RLatVent, RCerebWM, RPrecuneus, LParahipp, LMidTemporal, LInfTemporal, RParacentral, LLingual, LPutamVol, RBanksSTS. In the brainmap for VBM, the top 15 brain regions are (in descending order according to the ℓ 2 -norm of feature

CONCLUSION
To reveal relationship between neuroimaging data and cognitive test scores and predict cognitive score, we proposed a novel efficient matrix regression model which employs joint Wasserstein distances minimization on both loss function and regularization. To eliminate the natural limitation of the matrix norm in regression model, we utilize Wasserstein distance as distance metric. Wasserstein based regularizer can promote parameters that are close, according the OT geometry, which take into account a prior geometric knowledge on the regressor variables. Thus, our proposed method Furthermore, we provide an efficient algorithm to solve the proposed model. Extensive empirical studies on ADNI cohort demonstrate the effectiveness of our method.

AUTHOR CONTRIBUTIONS
JY, LL, CD, and HH designed the regression framework and implemented algorithm. JY wrote the manuscript and made the experiment. XW processed the data. JY and XW prepared figures and tables. CD, LL, XY, HH, and LS supervised study and revised the manuscript. All the authors approved the final version of the manuscript.