^{1}

^{2}

^{1}

^{2}

^{*}

^{1}

^{2}

Edited by: Rodrigo Orlando Kuljiš, Zdrav Mozak Limitada, Chile

Reviewed by: Jieping Ye, Arizona State University, USA; Heng Huang, University of Texas at Arlington, USA

*Correspondence: Dinggang Shen, Department of Radiology, Biomedical Research Imaging Center, University of North Carolina at Chapel Hill, 130 Mason Farm Road, Chapel Hill, NC 27599, USA e-mail:

This article was submitted to the journal Frontiers in Aging Neuroscience.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

In this work, we propose a novel subclass-based multi-task learning method for feature selection in computer-aided Alzheimer's Disease (AD) or Mild Cognitive Impairment (MCI) diagnosis. Unlike the previous methods that often assumed a unimodal data distribution, we take into account the underlying multipeak^{1}_{2,1}-penalized regression framework, through which we finally select features for classification. In our experimental results on the ADNI dataset, we validated the effectiveness of the proposed method by improving the classification accuracies by 1% (AD vs. Normal Control: NC), 3.25% (MCI vs. NC), 5.34% (AD vs. MCI), and 7.4% (MCI Converter: MCI-C vs. MCI Non-Converter: MCI-NC) compared to the competing single-task learning method. It is remarkable for the performance improvement in MCI-C vs. MCI-NC classification, which is the most important for early diagnosis and treatment. It is also noteworthy that with the strategy of modality-adaptive weights by means of a multi-kernel support vector machine, we maximally achieved the classification accuracies of 96.18% (AD vs. NC), 81.45% (MCI vs. NC), 73.21% (AD vs. MCI), and 74.04% (MCI-C vs. MCI-NC), respectively.

As the population is aging, the brain disorders under the broad category of dementia such as Alzheimer's Disease (AD), Parkinson's disease, etc. have been becoming great concerns around the world. In particular, AD, characterized by progressive impairment of cognitive and memory functions, is the most prevalent cause of dementia in elderly people. According to a recent report by Alzheimer's Association, the number of AD patients is significantly increasing every year, and 10–20 percent of people aged 65 or older have Mild Cognitive Impairment (MCI), a prodromal stage of AD (Alzheimer's Association,

To this end, there have been a lot of studies to discover biomarkers and to develop a computer-aided diagnosis system with the help of neuroimaging such as Magnetic Resonance Imaging (MRI) (Cuingnet et al.,

However, from a computational modeling perspective, while the feature dimension of those neuroimaging is high in nature, we have a very limited number of observations/samples available. This so-called “small-

In general, we can broadly categorize the approaches in the literature that aimed at lowering the feature dimensionality into feature-dimension reduction and feature selection. The methods of feature-dimension reduction find a mapping function that transforms the original feature space into a new low-dimensional space. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) (Martinez and Kak,

Meanwhile, the feature selection approach that includes filter, wrapper, and embedded methods selects target-related features in the original feature space based on some criteria (Guyon and Elisseeff, _{1}-penalized linear regression model (Tibshirani, _{1}-penalized regression model, with a sparsity constraint using ℓ_{1}-norm, many elements in the weighting coefficient vector become zero, thus the corresponding features can be removed. From a machine learning point of view, since the ℓ_{1}-penalized linear regression model finds one weight coefficient vector that best regresses a target response vector, it is considered as a single-task learning. Hereafter, we use the terms of a ℓ_{1}-penalized regression model and a single-task learning interchangeably.

The main limitation of the previous methods of PCA, LDA, and ℓ_{1}-penalized regression model is that they consider a single mapping or a single weight coefficient vector in reducing the dimensionality. Here, if the underlying data distribution is not unimodal, e.g., mixture of Gaussians, then these methods would fail to find the proper mapping or weighting functions, and thus result in performance degradation. In this regard, Zhu and Martinez proposed a Subclass Discriminant Analysis (SDA) method (Zhu and Martinez,

With respect to neuroimaging data, it is highly likely for the underlying data distribution to have multiple peaks due to the inter-subject variability (Fotenos et al., _{2,1}-penalized regression framework that takes into account the multipeak data distributions, and thus help enhance the diagnostic performances.

In this work, we use the ADNI dataset publicly available on the web^{2}^{3}

Female/male | 18/33 | 15/28 | 17/39 | 18/34 |

Age (Mean ± |
75.2 ± 7.4 [59–88] | 75.7 ± 6.9 [58–88] | 75.0 ± 7.1 [55–89] | 75.3 ± 5.2 [62–85] |

Education (Mean ± |
14.7 ± 3.6 [4–20] | 15.4 ± 2.7 [10–20] | 14.9 ± 3.3 [8–20] | 15.8 ± 3.2 [8–20] |

MMSE (Mean ± |
23.8 ± 2.0 [20–26] | 26.9 ± 2.7 [20–30] | 27.0 ± 3.2 [18–30] | 29 ± 1.2 [25–30] |

CDR (Mean ± |
0.7 ± 0.3 [0.5–1] | 0.5 ± 0 [0.5–0.5] | 0.5 ± 0 [0.5–0.5] | 0 ± 0 [0–0] |

With regard to the general eligibility criteria in ADNI, subjects were in the age of between 55 and 90 with a study partner, who could provide an independent evaluation of functioning. General inclusion/exclusion criteria^{4}

The structural MR images were acquired from 1.5T scanners. We downloaded data in Neuroimaging Informatics Technology Initiative (NIfTI) format, which had been pre-processed for spatial distortion correction caused by gradient non-linearity and B1 field inhomogeneity. The FDG-PET images were acquired 30–60 min post-injection, averaged, spatially aligned, interpolated to a standard voxel size, normalized in intensity, and smoothed to a common resolution of 8

The MR images were preprocessed by applying the typical procedures of Anterior Commissure (AC)-Posterior Commissure (PC) correction, skull-stripping, and cerebellum removal. Specifically, we used MIPAV software^{5}^{6}

For each ROI, we used the GM tissue volume from MRI, and the mean intensity from FDG-PET as features^{7}_{42},

In this section, we first briefly introduce the mathematical background of single-task and multi-task learning, and then describe a novel subclass-based multi-task learning method for feature selection in AD/MCI diagnosis.

Throughout the paper, we denote matrices as boldface uppercase letters, vectors as boldface lowercase letters, and scalars as normal italic letters, respectively. For a matrix _{ij}], its ^{i} and _{j}, respectively. We further denote the Frobenius norm and ℓ_{2,1}-norm of a matrix _{1}-norm of a vector as ‖_{1} = ∑_{i}|_{i}|.

Let ^{N × D} and ^{N} denote, respectively, the ^{8}

where ^{D} is a weight coefficient vector and ℝ(_{1}-penalized linear regression model has been widely and successfully used in the literature (Varoquaux et al.,

where λ_{1} denotes a sparsity control parameter. Since the method finds a single optimal weight coefficient vector

If there exists additional class-related information, then we can further extend the ℓ_{1}-penalized linear regression model into a more generalized ℓ_{2,1}-penalized one Figure

where ^{N×S} is a target response matrix, ^{D×S} is a weight coefficient matrix, _{2} denotes a group sparsity control parameter. In machine learning, this framework is classified into a multi-task learning since it needs to find a set of weight coefficient vectors {_{1}, …, _{S}} by regressing multiple response values of _{1}, …, _{S}, simultaneously^{9}

We illustrate the proposed framework in Figure

As stated in section 1, it is likely for neuroimaging data to have multiple peaks in distribution due to the inter-subject variability (Fotenos et al.,

To divide the training samples in each class to subclasses, we use a clustering technique. Specifically, thanks to its simplicity and computational efficiency, especially in a high dimensional space, we apply a _{k}}^{K}_{k = 1} denote a set of _{k}}^{K}_{k = 1} be the centers of the clusters (represented by row vectors). Given a set of training samples, the goal of

The main steps of

Initialize a set of ^{(0)}_{1}, …, ^{(0)}_{K}.

Assignment step: for each of the training samples {^{i}}^{N}_{i = 1}, find a cluster γ^{(t)}_{i} whose mean yields the least Euclidean distance to the sample as follows:

Update step: for every clusters {_{k}}^{K}_{k = 1}, compute the new mean with the samples assigned to the cluster as follows:
_{k}| denotes the number of samples assigned to the cluster _{k} at the iteration

Repeat (2) and (3) until convergence.

After clustering the samples in each class independently, we divide the original classes into their respective subclasses by regarding each cluster as a subclass. We then encode the subclasses with their unique labels, for which we use “discriminative” sparse codes to enhance classification performance. Let _{(+)} and _{(−)} denote, respectively, the number of clusters/subclasses for the original classes of “+” and “−.” Without loss of generality, we define sparse codes for the subclasses of the original classes of “+” and “−” as follows:

where _{(+)}}, _{(−)}}, _{K(+)} and _{K(−)} denote, respectively, zero row vectors with _{(+)} and _{(−)} elements, and ^{(+)}_{l} ∈ {0, 1}^{K(+)} and ^{(−)}_{m} ∈ {0, −1}^{K(−)} denote, respectively, indicator row vectors in which only the

For example, assume that we have three and two clusters for “+” and “−” classes, respectively. Then the code set is defined as follows:

It is noteworthy that in our sparse code set, we reflect the original label information to our new codes by setting the first element of the sparse codes with their original label. Furthermore, by setting the indicator vectors {^{(−)}_{m}}^{K(−)}_{m = 1} to be negative, the distances become close among the subclasses of the same original class and distant among the subclasses of the different original classes. That is, in the code set of Equation (10), the squared Euclidean distance between subclasses of the same original class is 2, but that between subclasses of different original classes is 6.

Using the newly defined sparse codes, we assign a new label vector ^{i} to a training sample ^{i} as follows:

where _{i} ∈ {+, −} is the original label of the sample ^{i}, and γ_{i} denotes the cluster to which the sample ^{i} was assigned in the

Thanks to our new sparse codes, it becomes natural to convert a single-task learning in Equation (2) into a multi-task learning in Equation (3) by replacing the original label vector ^{i}]^{N}_{i = 1} ∈ {−1, 0, 1}^{N × (1+K(+)+K(−))} where _{(+)} and _{(−)} denote the number of clusters in the original classes of “+” and “−,” respectively. Figure _{(+)} + _{(−)}) tasks. Note that the task of regressing the first column response vector _{1} corresponds to our binary classification problem between the original classes of “+” and “−.” Meanwhile, the tasks of regressing the remaining column vectors {_{i}}^{1+K(+)+K(−)}_{i = 2} formulate new binary classification problems between one subclass and all the other subclasses. It should be noted that unlike the single-task learning that finds a single mapping _{1}, …, _{(1+K(+)+K(−))}}, and thus allows us to efficiently use the underlying multipeak data distribution in feature selection.

Because of the ℓ_{2,1}-norm regularizer in our objective function of Equation (3), after finding the optimal solution, we have some zero row-vectors in ^{i}‖_{2} > 0. With the selected features, we then train a linear SVM, which have been successfully used in many applications (Zhang and Shen,

We considered four binary classification problems: AD vs. NC, MCI vs. NC, AD vs. MCI, and MCI-C vs. MCI-NC. In the classifications of MCI vs. NC and AD vs. MCI, we labeled both MCI-C and MCI-NC as MCI. Due to the limited number of samples, we applied a 10-fold cross-validation technique in each binary classification problem. Specifically, we randomly partitioned the samples of each class into 10 subsets with approximately equal size without replacement. We then used 9 out of 10 subsets for training and the remaining one for testing. We reported the performances by averaging the results of 10 cross-validations.

For model selection, i.e., number of clusters _{1} in Equation (2) and λ_{2} in Equation (3), and the soft margin parameter ^{−10}, …, 2^{5}}, λ_{1} ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5}, and λ_{2} ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5}. The parameters that achieved the best classification accuracy in the inner cross-validation were finally used in testing. In our implementation, we used a SLEP toolbox^{10}^{11}

To validate the effectiveness of the proposed Subclass-based Multi-Task Learning (SMTL) method, we compared it to the Single-Task Learning (STL) method that used only the original class label as the target response vector in Equation (2). For each set of experiments, we used 93 MRI features, 93 PET features, and/or 3 CSF features as regressors in the respective least square regression models. Regarding the multimodal neuroimaging fusion, e.g., MRI+PET (MP) and MRI+PET+CSF (MPC), we constructed a long feature vector by concatenating features of the modalities. It should be noted that the only difference between the proposed SMTL method and the competing STL method lies in the way of selecting features.

We visualized the data distributions of our dataset in Figure

MRI | 0.0005 (R) | 0.0004 (R) | 0.6967 (A) |

PET | 0.4273 (A) | 0.0239 (R) | 0.3150 (A) |

CSF | 0.0049 (R) | <0.0001 (R) | <0.0001 (R) |

Let TP, TN, FP, and FN denote, respectively, True Positive, True Negative, False Positive, and False Negative. In our experiments, we considered the following five metrics:

ACCuracy (ACC) = (TP+TN) / (TP+TN+FP+FN).

SENsitivity (SEN) = TP / (TP+FN).

SPECificity (SPEC) = TN / (TN+FP).

Balanced ACcuracy (BAC) = (SEN+SPEC) / 2.

Area Under the receiver operating characteristic Curve (AUC).

The accuracy that counts the number of correctly classified samples in a test set is the most direct metric for comparison between methods. Regarding the sensitivity and specificity, the higher the values of these metrics, the lower the chance of mis-diagnosing. Note that in our dataset, in terms of the number of samples available for each class, they are highly imbalanced, i.e., AD(51), MCI(99), and NC(52). Therefore, it is likely to have an inflated performance estimates for the classifications of MCI vs. NC and AD vs. MCI. For this reason, we also consider a balanced accuracy that considers the imbalance of a test set. Lastly, one of the most effective measurements of evaluating the performance of diagnostic tests in brain disease as well as other medical areas is the Area Under the receiver operating characteristic Curve^{12}

We summarized the performances of the competing methods with various modalities for AD and NC classification in Table

STL | ||||||

MRI | 90.45±6.08 | 82.67 | 90.50 | 93.55 | ||

PET | 86.27±8.59 | 82.00 | 90.33 | 86.17 | 90.12 | |

MP | 92.27±5.93 | 90.00 | 94.67 | 92.33 | 94.91 | |

MPC | 94.27±6.54 | 94.33 | 94.17 | 95.74 | ||

SMTL | ||||||

MRI | 93.27±6.33 | 88.33 | 93.33 | 94.19 | ||

PET | 89.27±7.43 | 90.00 | 88.33 | 89.17 | 91.67 | |

MP | 95.18±6.65 | 96.33 | 96.15 | |||

MPC | 96.33 |

In the discrimination of MCI from NC, as reported in Table

STL | ||||||

MRI | 74.85±5.92 | 80.67 | 64.00 | 72.33 | 76.55 | |

PET | 69.51±10.11 | 74.78 | 59.67 | 67.22 | 73.54 | |

MP | 74.85±3.91 | 84.78 | 56.00 | 70.39 | 78.79 | |

MPC | 76.82±7.15 | 85.89 | 59.33 | 72.61 | 79.25 | |

SMTL | ||||||

MRI | 76.82±7.15 | 85.78 | 59.67 | 72.72 | 77.84 | |

PET | 74.18±7.18 | 81.89 | 59.67 | 70.78 | 72.73 | |

MP | 79.52±5.39 | 62.00 | 75.44 | 77.91 | ||

MPC | 86.78 |

From a clinical point of view, establishing the boundaries between preclinical AD and mild AD, i.e., MCI, has practical and economical implications. To this end, we also performed experiments on AD vs. MCI classification and summarized the results in Table

STL | ||||||

MRI | 62.68±7.01 | 4.00 | 93.00 | 48.50 | 59.16 | |

PET | 72.02±6.73 | 31.33 | 93.00 | 62.17 | 69.50 | |

MP | 69.26±8.66 | 51.00 | 78.56 | 64.78 | 71.40 | |

MPC | 68.40±14.48 | 41.33 | 82.44 | 61.89 | 70.19 | |

SMTL | ||||||

MRI | 70.60±5.97 | 39.00 | 86.67 | 62.83 | 66.90 | |

PET | 73.31±3.25 | 33.00 | 63.50 | 67.78 | ||

MP | 89.00 | |||||

MPC | 72.60±9.88 | 37.33 | 64.17 | 71.74 |

Lastly, we conducted experiments of MCI-C and MCI-NC classification, and compared the results in Table

STL | ||||||

MRI | 56.98±20.61 | 51.00 | 60.67 | 55.83 | 58.85 | |

PET | 61.58±17.79 | 55.00 | 66.00 | 60.50 | 60.63 | |

MP | 64.62±14.04 | 62.50 | 66.00 | 64.25 | 63.87 | |

MPC | 62.89±12.29 | 58.50 | 66.00 | 62.25 | 58.31 | |

SMTL | ||||||

MRI | 61.60±13.12 | 44.00 | 75.67 | 59.83 | 60.76 | |

PET | 66.73±11.32 | 39.00 | 63.50 | 65.57 | ||

MP | 58.00 | 82.67 | ||||

MPC | 70.11±14.21 | 78.67 | 68.83 | 67.36 |

In order to further verify the superiority of the proposed SMTL method compared to the STL method, we also performed a statistical significance test to assess whether the differences in classification ACCs between the methods are at a significant level on the dataset by means of a paired

In the classifications of AD vs. MCI and MCI-C vs. MCI-NC, the proposed SMTL method with MP, rather than with MCP, achieved the best performances. That is, although we used richer information with MPC, i.e., additional CSF features, the performances with MPC were lower than with MP in those classification problems. Based on the results, fusing the CSF features with the other modalities turned out to be a confounding factor in the classifications of AD vs. MCI and MCI-C vs. MCI-NC. Furthermore, in our experiments above, the selected features were fed into a SVM classifier and in this stage, the features of different modalities have equal weights in decision, which can be a potential problem degrading the performances. To this end, we additionally performed experiments by replacing a Single-Kernel linear SVM (SK-SVM) with a Multi-Kernel linear SVM (MK-SVM) (Gönen and Alpaydin,

In Table

Kohannim et al., |
40/83/43 | MRI+PET+CSF | 90.7 | 75.8 |

Hinrichs et al., |
48/119/66 | MRI+PET | 92.4 | n/a |

Zhang et al., |
51/99/52 | MRI+PET+CSF | 93.2 | 76.4 |

Westman et al., |
96/162/111 | MRI+CSF | 91.8 | 77.6 |

Liu et al., |
51/99/52 | MRI+PET | 94.37 | 78.80 |

Proposed method | 51/99/52 | MRI+PET+CSF |

Regarding the interpretation of the selected ROIs, due to the involvement of cross-validation, multimodal neuroimaging fusion, and multiple binary classifications in our experiments, it was not straightforward to analyze the selected ROIs. In this work, we first built a histogram of the frequency of the selected ROIs of MRI and PET over cross-validations per binary classification, and normalized it by considering only the ROIs whose frequency was larger than the mean frequency and set the frequency of the disregarded ROIs to zero. Figure

In this paper, we proposed a novel method that formulates a subclass-based multi-task learning. Specifically, to take into account the underlying multipeak data distribution of the original classes, we applied a clustering method to partition each class into multiple clusters, which further considered as subclasses. Here, we can think that one cluster, i.e., subclass, represents one peak in distribution. The respective subclasses were encoded with their unique codes, for which we imposed the subclasses of the same original class close to each other and those of different original classes distinct from each other. We assigned the newly defined codes to our training samples as new label vectors and applied a ℓ_{2,1}-norm regularizer in a linear regression framework, thus formulated a multi-task learning problem. We finally selected features based on the optimal weight coefficients. It is noteworthy that unlike the previous methods of PCA, LDA, and other embed methods for dimensionality reduction, the proposed method considered multiple mapping functions to reflect the underlying multipeak data distributions, and thus to enhance performances in AD/MCI diagnosis. In our experimental results on the publicly available ADNI dataset, we proved the validity of the proposed method by outperforming the competing methods in four binary classifications of AD vs. NC, MCI vs. NC, AD vs. NC, and MCI-C vs. MCI-NC.

In the context of the practical application of the proposed method, it should be considered for how to determine the optimal number of clusters, i.e.,

The Reviewer Dr. Heng Huang declares that, despite having collaborated with the authors, the review process was handled objectively and no conflict of interest exists. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work was supported in part by NIH grants EB006733, EB008374, EB009634, AG041721, MH100217, and AG042599, and also by the National Research Foundation grant (No. 2012-005741) funded by the Korean government.

^{1}Even though the term of “multimodal distribution” is generally used in the literature, in order to avoid the confusion with the “multimodal” neuroimaging, we use the term of “multipeak distribution” throughout the paper.

^{2}Available online at “

^{3}Although there exist in total more than 800 subjects in ADNI database, only 202 subjects have the baseline data including all the modalities of MRI, FDG-PET, and CSF.

^{4}Refer to “

^{5}Available online at “

^{6}Available online at “

^{7}While the most intuitive feature should be the voxel in MRI and FDG-PET, due to their extremely high dimensionality, in this paper, we take a ROI-based approach and consider the GM tissue volumes and the mean intensity for each ROI from MRI and FDG-PET, respectively, as the features. Furthermore, by using the ROI-based features for our classification, the performances can be less affected by the partial volume effect in PET imaging (Aston et al.,

^{8}In this work, we have one sample per subject and consider a binary classification.

^{9}To regress each response value is considered as a task.

^{10}Available online at “

^{11}Available online at “

^{12}The receiver operating characteristic curve is defined as a plot of test true positive rate vs. its false positive rate.