Multi-modal, Multi-measure, and Multi-class Discrimination of ADHD with Hierarchical Feature Extraction and Extreme Learning Machine Using Structural and Functional Brain MRI

Structural and functional MRI unveil many hidden properties of the human brain. We performed this multi-class classification study on selected subjects from the publically available attention deficit hyperactivity disorder ADHD-200 dataset of patients and healthy children. The dataset has three groups, namely, ADHD inattentive, ADHD combined, and typically developing. We calculated the global averaged functional connectivity maps across the whole cortex to extract anatomical atlas parcellation based features from the resting-state fMRI (rs-fMRI) data and cortical parcellation based features from the structural MRI (sMRI) data. In addition, the preprocessed image volumes from both of these modalities followed an ANOVA analysis separately using all the voxels. This study utilized the average measure from the most significant regions acquired from ANOVA as features for classification in addition to the multi-modal and multi-measure features of structural and functional MRI data. We extracted most discriminative features by hierarchical sparse feature elimination and selection algorithm. These features include cortical thickness, image intensity, volume, cortical thickness standard deviation, surface area, and ANOVA based features respectively. An extreme learning machine performed both the binary and multi-class classifications in comparison with support vector machines. This article reports prediction accuracy of both unimodal and multi-modal features from test data. We achieved 76.190% (p < 0.0001) classification accuracy in multi-class settings as well as 92.857% (p < 0.0001) classification accuracy in binary settings. In addition, we found ANOVA-based significant regions of the brain that also play a vital role in the classification of ADHD. Thus, from a clinical perspective, this multi-modal group analysis approach with multi-measure features may improve the accuracy of the ADHD differential diagnosis.


INTRODUCTION
The neurodevelopmental disease of attention deficit hyperactivity disorder (ADHD) is among the major health problems both in developing and developed countries of the world. There are no standard biological tests available to diagnose ADHD except behavioral symptoms investigated with psychiatric measures (Arbabshirani et al., 2017). The high (3.4%) prevalence of ADHD among children and adolescents (Polanczyk et al., 2015) makes the automated diagnosis very important. The burden of ADHD on patients, their family, and the societies where they belong is very significant. Typically, the disorder begins to affect the patients from an early age to the entire life span, and without appropriate treatments, the illness leads to poor prognosis. Thus, the early and precise diagnosis of ADHD is very important. Children affected by this disorder have characteristic symptoms such as attention deficit, hyperactivity, and impulsiveness. Currently, it is believed that this characteristic manifestation of symptoms originates from the dysfunction of related cognitive processes (Diamond, 2005). In addition, the underlying mechanisms of ADHD seem to be associated with delayed cortical development (Shaw et al., 2007). The root cause of ADHD is still unknown (Arbabshirani et al., 2017). On the other hand, according to the Diagnostic and Statistical Manual of Mental Disorders, Fifth edition (DSM-5), there are three ADHD subtypes, based on the predominant symptoms: (1) predominantly inattentive presentation, (2) predominantly hyperactive-impulsive presentation, and (3) combined presentation (Association AP, 2013). Over recent years, a rapidly growing number of studies have been published that aim at complementing and improving clinical decision making on the basis of biomarkers derived from different types of data, such as magnetic resonance imaging (MRI) and genomic sequencing. Pattern recognition techniques have shown promising results to detect biomarkers from neuroimaging data. These techniques hold the potential to combine complementary information across different sources in an efficient way (Wolfers et al., 2015). In addition, many previous machine learning based studies investigated these subtypes of ADHD (Solanto, 2000;Castellanos and Tannock, 2002;Colby et al., 2012;Dai et al., 2012;Fair et al., 2012;Igual et al., 2012;Willcutt et al., 2012;Lim et al., 2013;Peng et al., 2013;Johnston et al., 2014;Deshpande et al., 2015;Hammer et al., 2015;Iannaccone et al., 2015;Xiao et al., 2016). However, the outcomes from a small number of studies that attempted to distinguish the features of these subtypes with neuroimaging methods were inconclusive (Pineda et al., 2002;Miller et al., 2006). In addition, very few studies including  were conducted on ADHD differential diagnosis (Arbabshirani et al., 2017).
These s/fMRI studies investigated the altered brain activation patterns in ADHD patients and healthy controls. However, the findings across different studies were inconsistent, and the different neural mechanisms between adults and children with ADHD remain unclear (Lei et al., 2015). In fact, the clinical experience of the psychiatrist, a detailed history taking of the patient, and other information resulting from interviews remain important. Pattern recognition techniques deployed in MRIbased neuroimaging studies dates back approximately a decade with the goal of classifying and thereby separating psychiatric patients from controls. However, despite many subsequent efforts, those promising results are not ready for clinical trials beyond research settings yet (Wolfers et al., 2015). However, in the past decade, not only in the psychiatry but also in other medical fields, interest in machine learning increased rapidly. There have been many recent studies using neuroimaging methods in the psychiatric research field (Klöppel et al., 2012;Castro et al., 2014;Schnack et al., 2014). Consequently, with the familiarity and understanding of neuroimaging methods, a pool of candidate features such as cortical thickness, volumetric data, functional connectivity measures, white matter volume, gray matter density, demographic information, and other fMRI data-based features was utilized for machine learning-based classification. Those studies serve as a bridge for the integration of machine learning and neuroscience. The comparatively higher resolution (usually one cubic millimeter or less) of structural MRI data (Arbabshirani et al., 2017) makes it a better choice for the use in classification experiments either as a standalone or as a co-modality in multi-modal experimental settings. The most abundantly used and popular machine learning tool among the neuroimaging community is support vector machine (SVM) (Tenev et al., 2014;Arbabshirani et al., 2017). In this comparative study, we compared an extreme learning machine (ELM) based classification framework with SVMs.
In this study, we focused on the machine learningbased differential diagnosis of the subtypes of ADHD. We utilized an ELM classifier to distinguish ADHD-combined-type (ADHDC) and ADHD-inattentive-type (ADHDI) patients from normally/typically developing children (TDC). In addition, to achieve better classification accuracy, we proposed a modified recursive feature elimination RFE algorithm . This algorithm treats cortical and sub-cortical measurement data as an input. The cortical measures include mean cortical thickness, standard deviation of the cortical thickness, surface area, curvature and volume of predefined segments, or anatomical regions (Wee et al., 2013(Wee et al., , 2014Gori et al., 2015). In addition, we also used the sub-cortical measures of structural MRI data including image intensity, white matter volume, and mean volume of the sub-cortical regions. These subcortical measures were not used in our previous study . Global functional connectivity measures from atlas-based cortical regions served as features of the functional MRI data. Moreover, we performed the ANOVA test on both the structural and the functional MRI data and the corresponding significant regional measures served as ANOVAbased features for classification. These ANOVA-based features were acquired in a different manner from the atlas ROI-based features, therefore, we were interested to observe their effect on the classification accuracy. A fully cross-validated, ROI-based data-driven analysis was used for classification.
Several previous studies identified certain numbers of anatomical regions to be affected in ADHD. However, there were very few studies (Arbabshirani et al., 2017) that investigated the broadly different anatomical regions that could play a vital role in the classification of ADHD sub-groups.

Dataset
We obtained neuroimaging data of ADHD patients from the ADHD-200 MRI dataset (Biswal et al., 2010), which is publicly available at the 1,000 Functional Connectomes Project website http://fcon_1000.projects.nitrc.org/indi/adhd200. These T1-weighted structural and resting state functional MRI scans were acquired at six different institutes. The ages of the participants ranged from 7 to 14 years. The sites were Brown University (BU), New York University Child Study Center (NYU), Beijing Normal University (BNU), Kennedy Krieger Institute (KKI), Oregon Health, and Science University (OHSU), and Washington University in St. Louis (WU). All participants have been scanned under 3.0-Tesla scanners. The other technical details about the scanner parameters from each participating site are available at the above-mentioned URL of the ADHD-200 global competition. In addition, all the sites contributing to ADHD-200 had been approved by their local institutional review board (IRB) and complied with local IRB protocols .

Subjects
There were approximately 1,000 subjects in the ADHD-200 dataset. However, we followed a balanced design approach and chose 53 subjects exactly as described in  for training from each of the three groups: typically developing children (TDC), ADHD inattentive type (ADHDI), and ADHD combined type (ADHDC). All the training subjects were selected from the training part of the ADHD-200 dataset that consists of 776 subjects. Each group contained nine female and 44 male subjects. There may have existed some intrinsic bias in the data regarding batch effects as well as hardware bias due to multisite data collection. However, we retained it as default for the criteria of subjects selection for this study similarly to . More details regarding the demographic features of the dataset can be acquired from the same resources listed above. Testing data were separately gathered with matched gender, age, and IQ information from ADHD-200 testing dataset, which corresponded to a pool of 197 subjects. Table 1 summarizes the demographic characteristics of our selected training and testing subjects.

Preprocessing of Structural MRI Data
Cortical reconstruction and volumetric segmentation were performed with the FreeSurfer v5.3.0 image analysis suite, which is documented and freely available for download online at http://surfer.nmr.mgh.harvard.edu/. The technical details of these procedures have been described elsewhere (Dale and Sereno, 1993;Dale et al., 1999;Fischl et al., 1999aFischl et al., ,b, 2001Fischl et al., , 2002Fischl et al., , 2004aFischl and Dale, 2000;Ségonne et al., 2004;Han et al., 2006;Jovicich et al., 2006;Reuter et al., 2010Reuter et al., , 2012. Briefly, the preprocessing procedure includes motion correction of volumetric T1-weighted images, removal of non-brain tissue using a hybrid watershed/surface deformation procedure (Ségonne et al., 2004), and automated Talairach transformation. Moreover, the procedure includes the segmentation of the subcortical white matter and deep gray matter volumetric structures including hippocampus, amygdala, caudate, putamen, ventricles (Fischl et al., 2002(Fischl et al., , 2004a, intensity normalization (Sled et al., 1998), tessellation of the gray matter-white matter boundary, and automated topology correction (Fischl et al., 2001;Ségonne et al., 2007). In addition, surface deformation following intensity gradients was performed to optimally place the gray/white and gray/cerebrospinal fluid borders at the location where the greatest shift in intensity defines the transition to the other tissue class (Dale and Sereno, 1993;Dale et al., 1999;Fischl and Dale, 2000). Once the cortical models are complete, a number of deformable procedures can be performed for further data processing and analysis. These procedures include surface inflation (Fischl et al., 1999a), registration to a spherical atlas that is based on individual cortical folding patterns to match the cortical geometry across subjects (Fischl et al., 1999b). Moreover, parcellation of the cerebral cortex into units with respect to gyral and sulcal structure (Fischl et al., 2004b;Desikan et al., 2006) and creation of a variety of surface based data including maps of curvature and sulcal depth are part of the procedure. This method uses both intensity and continuity information from the entire three-dimensional MR volume in segmentation and deformation procedures to produce representations of cortical thickness, calculated as the closest distance from the gray/white boundary to the gray/CSF boundary at each vertex on the tessellated surface (Fischl and Dale, 2000). The maps are created using spatial intensity gradients across tissue classes rather than simply relying on absolute signal intensity. The maps produced are not restricted to the voxel resolution of the original data, and are thus capable of detecting submillimeter differences between groups. Procedures for the measurement of cortical thickness have been validated against histological analysis (Rosas et al., 2002) and manual measurements (Kuperberg et al., 2003;Salat et al., 2004). FreeSurfer morphometric procedures have been demonstrated to show good test-retest reliability across scanner manufacturers and across field strengths (Han et al., 2006;Reuter et al., 2012). A cortical surface-based Desikan-Killiany-Tourville (DKT) atlas Klein and Tourville, 2012) was mapped to a sphere aligning the cortical folding patterns, which provided accurate matching of the morphologically homologous cortical locations across subjects. For each of the DKT31 protocol-based segments, FreeSurfer calculated the nine different measures, including number of vertices, surface area, gray matter volume, average cortical thickness, cortical thickness standard deviation, cortical mean curvature, cortical Gaussian curvature, cortical folding index, and cortical curvature indices (Colby et al., 2012). For the subcortical regions, FreeSurfer calculated the area and volume of the whole segment, white matter volume, intensity and overall volume of the whole brain divisions including cerebrospinal fluid (CSF), intracranial volume (ICV), gray matter (GM), and white matter (WM). Two of the selected measures are the most common features in the structural studies (Arbabshirani et al., 2017). The surface area was calculated by computing the area of every triangle in a standardized spherical surface tessellation. The local curvature was computed using the registration surface based on the folding patterns .

Cortical and Subcortical Features
We used nine of the above-mentioned measures including cortical thickness, surface area, mean curvature, volume, and cortical thickness standard deviation. Measures used as the structural ROI features included whole segment volume, white matter volume and intensity from subcortical regions, and overall brain volumes. In addition, After the preprocessing, FreeSurfer's QA Tools were used for the detection and removal of the outliers and negative features.

Preprocessing of Resting State Functional MRI Data
Preprocessing of functional MRI data was based on Analysis of Functional Neuroimages AFNI software; http://afni.nimh.nih. gov/afni/ (Cox, 1996). Every single echo planar image (EPI) volume was co-registered to the corresponding anatomical image of the subject and mapped to Talairach coordinates space with the TT_N27+tlrc template. We excluded the first six images from each EPI volume to achieve the MR steady state. In addition, slice-timing correction was performed. We censored and cut out time points based on their number of outliers and head motion magnitude. Same number of slices were cut out for all subjects. Slice alignment was applied by using the local Pearson's correlation (LPC) cost function. The correction of head motion along with averaging the EPI volumes was performed to obtain a mean functional image. The dataset was already reoriented to right posterior inferior (RPI) in the ADHD-200 dataset repository. Each EPI volume underwent the linear multiple regression to regress the motion derivatives and effects of the white matter and cerebrospinal fluids. Spatial smoothing was performed by using a Gaussian kernel with a blur size of 6-mm full width at half maximum (FWHM). A polynomial detrending was applied. AFNI program 3dmaskave was used to calculate the beta values for the correlation analysis. A run of 10,000 Monte Carlo simulations was conducted with AlphaSim program. The cluster size of 10 voxels was determined at a familywise error rate corrected with p < 0.000001 to avoid the problem of multiple comparisons. As a result, the clusters consisting of lower than 10 voxels were discarded from the analysis.

Global Connectivity Features
Global connectivity measure was used to calculate the average brain-wise correlation coefficients (GCOR) of all the possible combinations of voxel time series. The GCOR estimation of the cortical regions is a computationally expensive process. It involves the calculation of M (M-1)/2 correlation estimates for an M voxels volume (Saad et al., 2013). AFNI simplifies this problem by taking de-meaned time series of each voxel and scaling it by its Euclidean norm. In addition, it averages the scaled time series over the whole brain mask and finally the length (l 2norm) of this averaged series represents the GCOR. ADHD is considered as an age-related neurodevelopmental disorder and the symptoms might reflect age-related differences in the cortical and subcortical maturation that characterize ADHD (Xiao et al., 2016). Therefore, we first made a gray matter mask of 5-mm thickness created by using AFNI program 3dSeg and 3dcalc to acquire the functional connectivity maps of only the cortical regions of the brain. The clean EPI data were resampled with the gray matter mask to obtain the data only within the cortical regions. We used AFNI program 3dTcorrMap to get the global functional connectivity maps. This involved the calculation of correlation (Pearson's r) of the residual time series in each voxel with every other voxel in their brain mask and recording the mean correlation back in the voxel. These connectedness values were further transformed with Fisher's z-transformation to yield normally distributed measures (Gotts et al., 2012). We used DKD_ Desai_PM Atlas (Desikan et al., 2006) of AFNI package to acquire the ROI measures from the 102 cortical regions across the whole brain. Figure 1 depicts a typical global functional connectivity map for a single ADHD patient. A summary of all the atlas-based structural and functional ROI features is presented in Table 2.

Statistical Analysis
After preprocessing, the structural data were converted into surface maps for ANOVA analysis. Heat kernel smoothing was applied on the dataset after conversion. An AFNI program 3dANOVA2 (Cox, 1996) was used for this analysis. First, to examine the differences among the three groups, the multiple comparisons test was performed using the AFNI program 3drefit with the FDR correction on both the structural and the functional data. In addition, for the structural data, the AFNI program SurfClust was used to find regions that have significant differences between groups with an area of more than 50 mm 2 . The p-value threshold was set to be 0.001 . This was followed by a post-hoc t-test to evaluate the actual differences among the three groups. For the functional data, we used AFNI program 3dclust to generate the mask of significant ROIs of ANOVA analysis. In addition, 3dROIstats was used with this mask to acquire the average connectivity value from each subject. Since these data were very significant, we set the p-value at a very low level of 0.000001. Finally, we extracted 12 significant FIGURE 1 | A typical global functional connectivity map for a single ADHD patient. The left column shows the axial view; middle column shows the sagittal while the rightmost column shows the coronal view. The color temperature in the connectivity maps represents the strength of the connectivity measure between different resting-state networks in the cortical region. ROI, Region of interest; ANOVA, analysis of variance; SD, standard deviation. The atlas-based ROI feature count for structural data was obtained by following the same procedure as used in . The functional data feature count was based on the number of ROI obtained by DKD_Desai_PM atlas (Desikan et al., 2006) from AFNI software package. Table S1. (Supporting Information) describes the features name and count in detail.
regions from the structural data and nine significant regions from the functional data that served as the ANOVA-based features for classification. It is worth noting that the ANOVA analysis was performed only on the training dataset to avoid the potential bias (Arbabshirani et al., 2017) of the overall accuracy of the classifier.

ANOVA ROI Based Features
The mean measures of the significant regions acquired from the ANOVA analysis was used as the features for the classification.

Pearson Correlation Coefficient of ANOVA-Based Significant Features
The Pearson correlation coefficients were calculated between structural and functional ANOVA-based significant ROIs and age separately. For the structural data, we extracted the average cortical thickness of each ROI for each individual subject using the AFNI program 3dmaskave. The p-value of the correlation coefficients was calculated based on our sample size (Rosenhead, 1963). For the functional data, we calculated the average beta value of each ANOVA-based ROI of each individual subject using AFNI program 3dmaskave.

Hierarchical Feature Selection and Ranking
In neuroimaging, pattern recognition is sometimes referred to as multi-voxel pattern analysis (MVPA), as voxels are often used directly as features. The features can be categorized as voxel, region, or network-derived set of values. In a voxelbased feature set, features are extracted on the voxel level. In a region-based feature set, features are derived by parcellation of brain images from predefined regions based on anatomical or functional brain atlas. In a network-based feature set, features are extracted by combining voxels across networks, like the functional connectivity based features (Wolfers et al., 2015). We used the last two approaches (region and network driven features) in this study in addition to an uncommon approach of statistical measurement (ANOVA) based feature extraction. The region-based feature selection is the most common approach used in the fMRI studies (Arbabshirani et al., 2017). In this study, the multi-measure features were acquired from the significant regions obtained by ANOVA analysis and atlasbased ROIs separately for both structural and functional MRI datasets. Finally, we gathered a big pool of features as shown in Table 2. Therefore, optimal feature selection was necessary to get a high classification accuracy.

Feature Optimization
Optimal feature selection provides a subset of features that leads toward the optimal classification accuracy . We used a hierarchical approach to optimize the features for the best classification accuracy. Both the filter-and wrapper-based feature selection models were used in this study. Specifically, this study utilizes a combination of two feature selection methods, namely the revised recursive feature elimination (rRFE) proposed by  and the least absolute shrinkage and selection operator (LASSO).

LASSO
A feature selection method that utilizes the objective optimization can be conveniently formulated as convex optimization problems with global optimal solutions. A good example of an objective function with an error term and a regularization term is the least absolute shrinkage and selection operator (LASSO) based regularized regression for features selection. We used Matlab implementation of lasso regularization with the regularization coefficient λ in this study. The value of λ was optimized by grid search method. For a given value of λ which is a non-negative parameter, lasso solves the problem: Where n is the number of observations. u i is the response at observation i. x i is input data, a vector of q values at observation i. λ is a non-negative regularization parameter. The parameters γ o and γ are scalar and q -vector respectively. As λ increases, the number of nonzero components of γ decreases. The lasso problem involves the l 1 norm of γ (Tibshirani, 1996;Hastie et al., 2009;Friedman et al., 2010). After selection based on the cross-validated LASSO, selected features were finally fed to the rRFE algorithm for final ranking.

Revised Recursive Feature Elimination (rRFE)
Recursive features elimination based on linear SVM (RFE-SVM) is a well-known wrapper model for feature optimization. A wrapper model usually consists of two steps, (1) feature subset selection using the accuracy of the base classifier, and (2) learning and testing with the best feature subset. The wrapper approach utilizes the prediction performance of a base classifier. It performs selection by using the classifier as a black box and rank the subset of features by their predictive power (Yusta, 2009). The preselected features further fed into a wrapper model rRFE to generate the feature-ranking list according to the order of significance. In this study, we used an rRFE algorithm , a recently proposed modified version of the original RFE-SVM. A more detailed justification for choosing this method for feature selection can be found in .

Classification
Two classifiers were used in this study, namely, linear extreme learning machine (ELM) and support vector machine (SVM) with linear and radial basis function (RBF) kernels. The classifiers were used in both binary and multi-class settings. In the case of multi-class, we used the one versus all (OVA) approach. In addition, we also mentioned the ELM-based classification results without applying the feature selection to validate the significance of the proposed feature selection framework. A brief description of both classifiers used in this study is as following.

Extreme Learning Machine Classifier
Extreme learning machine originally proposed by Huang et al. (2006) has been adopted in many previous neuroimaging studies (Termenon et al., 2013;Zhang et al., 2015) in the binary and multi-class settings (Huang et al., 2012). However, to the best of our knowledge, the present study is the first in this domain that utilized sigmoid activation function based ELM for the multi-class, multi-measure, and multi-modal classification of neuroimaging data. ELM randomly assigns the weights and bias to the input data to compute the output weight matrix. This random assignment of weights makes the ELM algorithm very fast as compared to SVMs. A more detailed discussion of the classifier can be found elsewhere (Huang et al., 2006;. The ELM classifier requires the "number of nodes" as the only hyper-parameter that needs to be tuned for achieving the maximum performance in terms of accuracy. In this study, we used the Matlab implementation of the ELM. A greedy search method was used to tune this parameter for achieving the maximum test accuracy. In this study, the search scale for selecting this parameter was set to N = [1, 2, ..., 200]. All the features were normalized and scaled to the values between -1 and 1 for the better performance of ELM classifier .

SVM Classifier
SVM, which was originally proposed by Cortes and Vapnik (1995), considered as one of the most popular machine-learning tools in the neuroscience domain in the last decade. It is a supervised classification algorithm. We used this algorithm in both binary and multi-class (more commonly known as one-versus-all or OVA) settings. It maps features in higher dimensional space using linear and nonlinear functions known as kernels. In this study we used both linear and nonlinear radial basis function (RBF) kernels . The only parameter that needs to be tuned in SVM is the value of regularization parameter "σ " while utilizing the radial basis function (RBF) kernel. We used a greedy search method to tune this parameter for achieving the maximum test accuracy. In this study, the search scale for selecting this parameter was set to σ = [0.1, 0.2, ..., 2.0].

Performance Evaluation and Significance Testing Methods
The classifier performance was measured in terms of classification accuracy and Kappa score for multi-class classification. Classification accuracy was determined as Where TP is true positive, TN is true negative, FP is false positive and FN is false negative. These values were obtained by computing the confusion matrix . Kappa score is considered as a robust statistical measure for the assessment of multi-class classifiers. It is defined by Cohen (1960)  Where Pr (a) is the relative observer agreement between raters and Pr(e) is hypothetical probability of chance agreement. The value of k ranges between +1 and −1, where k = 1 represents perfect classification, k = 0 represents chance level, and k = −1 represents completely erroneous classification . A Permutation test was used to assess the statistical significance of the ELM classifier performance (Golland and Fischl, 2003). Briefly, it works as follows. First, we choose the actual test accuracy as the test statistic of the classifier, the class labels for testing dataset permuted randomly and given to the classifier and check for the cross validation. Generally, the lower p-value of the permuted prediction rate against the prediction rate with the original data labels indicates the higher significance of the classifier performance. There is no fixed rule for setting the number of permutations. We have permuted the data 10,000 times in the current study . Figure 2 depicts the overview of the classification framework of this study.

RESULTS
We used ANOVA analysis to extract the features for classification in this study along with the ROI based features as proposed in a recent study . These ANOVA-based features were gathered separately from both structural and functional MRI datasets. Classification accuracy was measured for both unimodal and multi-modal features. The following explanation of the ANOVA results of each modality describes those ANOVAbased features briefly.

Structural ANOVA
ANOVA analysis was applied to the cortical thickness measures acquired by FreeSurfer and further smoothed with Heat kernel by AFNI preprocessing pipelines. We used only cortical thickness measures for the ANOVA analysis as it is the most important structural imaging biomarker (Shaw et al., 2007;Narr et al., 2009). A basic description of this results was presented briefly in a recent previous work . In the present study, we are using the ANOVA results for extracting the features for classification. The ANOVA tests showed (see Figure 3) 10 areas of significant thickness change in the left hemisphere and two areas in the right hemisphere (corrected at p < 0.001, cluster size 50 mm 2 ). AlphaSim was applied on the result for correction. Table 3 summarizes our findings.

Functional ANOVA
The ANOVA tests of resting state fMRI data showed nine areas of significant change (corrected at p < 0.000001, cluster size 11 mm 2 ) (see Figure 4) from four significant functional networks of the human brain including the ventral attention network, sensorimotor network, visual network, and cerebellar network. Table 4 summarizes the details of the significant regions. Our classification results confirm that these regions can be used as potential biomarkers/features for the multi-class classification of ADHD s/fMRI dataset. The ANOVA-based features gave high classification accuracy both as standalone modality and in combination with ROI-based features.

Multi-class Classification
Our results suggest that the optimized features are robust and yield higher classification accuracies both in binary and multiclass classification settings. We achieved a maximum of 76.190% accuracy (p < 0.0001) on the test data by using the proposed hierarchical feature extraction framework with (ELM) in multiclass settings. Table 5 summarizes the overall multi-class (one vs. all settings) accuracies, kappa score, and p-values of structural, functional, multi-modal, and multi-measure data.

Binary Classification Results
Binary classification results are presented in the current study to demonstrate the fact that, ideally, multi-class classification results show equaling and generally lower accuracy than results from the binary classification of the same dataset .
The major result of the current study is the highest prediction accuracy of the ADHD dataset using all the measures and modalities together as features; therefore, we performed the binary classification on the combined measures only. In binary classification settings, we achieved a maximum of 92.857% (p < 0.0001) prediction accuracy. It is interesting to note that the proposed feature selection framework selected only structural features from multi-modal input to acquire this accuracy. Most significant feature measures were mean cortical thickness, volume, area, and standard deviation of cortical thickness respectively. Even though the functional features were not selected in the final selection but they assist the selection algorithm to extract the most discriminative structural features (both ROI and ANOVA-based) from the pool. When we used only the structural features for binary classification we could not reach the same accuracy as we did while using the multimodal features. In addition, we were not able to achieve this high accuracy in the previous work  because all of the cortical feature measure types utilized in the present study were not used in it. We also learned from the previous study ) that many of the feature measures do not have any predictive power including folding and curvature indices, therefore we did not include those measures in this study. Briefly, we achieved the following binary accuracies with each individual modality. With only structural ANOVA features 89.29%; with only functional ANOVA features 82.14%; with only functional ROIs 85.17%, and with only structural ROIs 85.30%. Table 6 summarizes the classification scores of all the disease groups along-with sensitivity, specificity, F1-score, recall, and precision measures for all the multi-modal and multi-measure features.
It is important to note that two of the three ELM classifierbased binary classification scores (ADHDC-TDC and ADHDI-TDC) without feature selection overlap exactly with the accuracy of the linear SVM with the most discriminative features. In other words, we can say that the lowest accuracy of the ELM classifier overlaps with the highest accuracy of the SVM classifier.

Correlation Coefficient: Structural Data
After applying the Bonferroni correction method with p < 0.000463, we found two regions that were negatively correlated with age in TDC group (Table 7). These regions were left middle temporal gyrus and left superior frontal gyrus. On the other hand, some individual IQ and symptoms severity score data were not included in the original database. Therefore, we only use the subjects with available IQ and symptoms severity score information to perform the calculation of the correlation coefficients. Those subjects with missing IQ information included nine from the TDC group and 13 from the ADHDC group. The subjects with missing symptoms severity score were included 29 from ADHDI group, 11 from ADHDC group and 36 from TDC group. In contrast, there was no missing age information.

Correlation Coefficient: Functional Data
For the functional data, the beta value for each of the ANOVAbased significant cluster was calculated by using AFNI program 3dmaskave. The Pearson correlation coefficients were calculated between the beta values and age, IQ, and symptoms severity scores. 4 | Nine regions with significant differences, based on functional ANOVA (corrected at P < 0.000001, and cluster size 11 mm 2 ).

Atlas regions
Cluster size (mm 2 ) coordinates After applying the Bonferroni correction method with p < 0.000617, no significant region was found.

Significant Features
Highest subgroup classification accuracies were obtained by using the significant features as shown in Table S2 (Supplementary Material).

DISCUSSION
This study reports the multi-modal, multi-measure, and multiclass classification results on the ADHD-200 dataset. We calculated and used the global connectivity maps from the functional part of the dataset as classification features. Atlasbased anatomical measures from the structural part of the dataset were also used as features for classification. A hierarchical feature extraction framework is proposed in this study to extract the features with high prediction score. An ELM classifier was used to classify the data. The ELM classifier in combination with the proposed feature extraction framework outperforms the ELM without feature extraction and two SVMs with the proposed feature extraction framework.
The proposed hierarchical feature extraction method enables us to achieve high and significantly accurate classification accuracies. As a result, we achieved as high as 76.190% (p < 0.0001) accuracy in multi-class settings. To the best of our knowledge, this is the highest test accuracy reported on the ADHD-200 dataset. Therefore, the multi-modal, multi-measure and multi-class classification was an effective approach to achieve high discrimination scores.
The ELM classifier in combination with the sparse feature autoencoder works very well for the classification of neuroimaging data such as in a recent previous work  based on hierarchical extreme learning machine (H-ELM) (Tang et al., 2016) classification framework. However, the current study proposed a more simplified sparse feature extraction framework that enables us to acquire higher classification accuracies.

Comparison with Previous Work
In recent years, many studies about classification-based diagnosis of ADHD have been reported. Most of these studies were binary and the classification scores were ranging from about 72 to 95% (Igual et al., 2012;Lim et al., 2013;Peng et al., 2013;Johnston et al., 2014;Deshpande et al., 2015;Hammer et al., 2015;Iannaccone et al., 2015;Wolfers et al., 2015;Xiao et al., 2016). Except for one study, all used feature selection methods prior to classification (Wolfers et al., 2015). In Xiao et al. (2016), a different ADHD dataset was used with only 47 subjects, compared to the current study. Although the theme of this study by Xiao et al. (2016) is similar to our present work, the results are not comparable due to dataset differences and their use of only cortical thickness ROIs. In addition, our main focus was the multi-class, multi-measure, and multi-modal classification of the ADHD-200 dataset. Previously, a very few multi-class (Fair et al., 2012; and multi-modal (Colby et al., 2012;Dai et al., 2012) classification studies have been published with ADHD-200 dataset. However, to the best of our knowledge, the present study is the first multimodal, multi-measure, and multi-class classification study of the ADHD-200 dataset. In addition, the multi-class classification accuracy of 76.190% (p < 0.0001) is the highest reported accuracy on the ADHD-200 dataset in multi-class settings. However, all the aforementioned studies, except a recent study , used different subject selection compared to us. Therefore, the present results are not comparable to the previous studies. The multi-class classification score of the structural part 73.810% (p < 0.0001) of the present study is much higher than the results (60.78%) in , which utilized the exact same subject selection for classification. The present study is also comparable to  in that it also utilized sparse hierarchical feature extraction framework with basic ELM, while  used H-ELM classification framework, which is a similar approach in principal. However, the method of the present study outperformed all the results in  because the proposed hierarchical feature extraction framework can extract features with the higher predictive power as compared to the sparse autoencoder. In addition, in the present study we utilized many additional feature measures that were not used in . We believe that these measures played a vital role in increasing the classification accuracy. However, the exact causes of the boosted accuracy are not yet determined.

Correlation Analysis of the Significant Regions
We two significant negative Pearson correlation coefficients between the cortical thickness of the regions and age. During adolescence, normal children undergo the cortical thinning phase (Sowell et al., 2004). Consistent with these findings, the two cortical regions in our results showed a decreasing tendency of cortical thickness with age in the TDC group. However, this pattern seemed to be weaker in the ADHDC and ADHDI groups: no regions in each had negative correlation coefficients. In other words, the patients exhibited some deficits in cortical thinning in the middle temporal gyrus and superior frontal gyrus. While considering the middle temporal gyrus; it can be included in ventral attention networks and superior frontal gyrus is related to the dorsal attention networks, therefore, the structural thinning deficits might be used as important markers to discriminate patients and controls (Aboitiz et al., 2014). Thus, we postulate that normal cortical thinning development was interrupted in the ADHD groups. Based on previous findings suggesting that ADHD is related to cortical thinning, the patients might experience abrupt thinning of some brain regions related to attention and motor functions (Makris et al., 2007;Narr et al., 2009). Moreover, it is evident from our classification results that those significant statistical differences may underlie the high classification accuracy. The results in Vaidya (2011), McLeod et al. (2014, Tegelbeckers et al. (2015) further endorse our results of the correlation analysis of ANOVA-based significant regions with age.

Clinical Relevance of Statistical Results
The significant regions from both structural and functional ANOVA results play an important role in acquiring highly accurate differential diagnosis when used in combination with the ROI based significant features that did not form part of the outcome of the ANOVA analysis. The following description of the regions obtained from our ANOVA analysis may have clinical significance: Most of the neuroimaging studies addressing ADHD emphasize that the selected regions in the present study has a strong connection with the neurodevelopmental basis of this disorder. Specifically, the cortical thinning problem and the attention and motor network hypoconnectivity have been suggested to be involved in ADHD (McCarthy et al., 2014;McLeod et al., 2014). These regions showed significant differences between the ADHD and control groups . The functional ANOVA results again endorse the significance of our noteworthy results. A few of the significant regions from our functional ANOVA results overlap with those of structural ANOVA results. They include the left precentral gyrus, the right precentral gyrus, the left middle frontal gyrus, and the right superior temporal gyrus, which are all related to the motor or attention networks.
Our significant feature results showed that structural information is more crucial for the classification. It means that combined gross classification by sMRI and supporting information from fMRI can increase accuracy. In our structural ANOVA results, multiple areas of the frontal cortex, including the middle frontal gyrus (MFG), the precentral gyrus, and the postcentral gyrus in the parietal cortex, showed significantly higher cortical thickness in the TDC group. Our findings are in accordance with previous results (Durston, 2003;Shaw et al., 2007;Valera et al., 2007), and this might contribute to the high classification accuracy of our model. On the other hand, although not much is known about the temporal lobe (Lorberboym et al., 2004), it is evident that the cortical thickness differs significantly in ADHD patients as compared to the normal control group (Fernández-Jaén et al., 2014). Cortical thickness changes over time, therefore we focused on the dynamic changes of thickness according to the ages. Generally, ADHD has cortical thinning compared to controls after development. In Narr et al. (2009) the mean age was of 11.7 years. However, ADHD has cortical thinning problems during the development which are related to the delayed cortical maturation. The mean age of peak brain thickness is 10.5 years for ADHD and 7.5 years for controls. It means after 10.5 years of age, ADHD experiences fast decreases of cortical thickness while controls have enough time to maturate the brain. In line with these evidences, we can say that ADHD has delayed cortical thinning, but finally they can get thinner cortex than controls. In addition, many brain areas can maturate with different speeds, therefore that the results were confusing for us. Our results from the functional ANOVA reconfirm this finding. The middle temporal gyrus is associated with language abilities, visual perception, multi-modal sensory integration, and semantic memory processing . Our findings from the functional ANOVA result are consistent with these results. The bilateral postcentral gyrus (Martinussen et al., 2005), right declive (Ortiz et al., 2015), and Left middle occipital gyrus were reported to be affected in ADHD. Visual perception ability is also related to the visuospatial working memory which is the common symptom of ADHD (Valera et al., 2010;Vaidya, 2011;McLeod et al., 2014). ADHD is related to the semantic memory and language problems (Felton et al., 1987). Our functional ANOVA results endorse these findings, thus making our observations highly beneficial to clinicians.

CONCLUSIONS
This study reports the correlation between both structural and functional measures as well as age IQ and symptom severity score. However, IQ and symptoms severity score information, which is also a very important parameter to determine the correlation among different groups, was missing for a few subjects in the original dataset. In addition, the exact cause of the boosted accuracy was unknown. This is a major limitation of this study.
In conclusion, we found that the proposed hierarchical feature extraction model in combination with ELM serves very well both the binary and the multi-class classification of ADHD. In the future, we will perform linear mixed effect modeling and multi-variate modeling on the selected data of the ADHD-200 dataset in order to acquire the significant regions using both neuroimaging (structural and functional) data and age information.

AUTHOR CONTRIBUTIONS
MQ selected the subjects for balanced experiment design from the publicly available ADHD-200 database, pre-processed the data, developed, and applied the feature selection methods, classified the data, and found the significant clinical biomarkers. JO, BM, HJ, and BL helped in drafting the introduction and discussion parts of the manuscript with MQ. BL supervised the entire research process and revised the manuscript for publication. All authors contributed to the research design, results interpretation, and proofreading of the final manuscript.

ACKNOWLEDGMENTS
Authors thankfully acknowledge the ADHD-200 consortium for the released dataset.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum. 2017.00157/full#supplementary-material