Exploiting the brain's network structure in identifying ADHD subjects

Attention Deficit Hyperactive Disorder (ADHD) is a common behavioral problem affecting children. In this work, we investigate the automatic classification of ADHD subjects using the resting state functional magnetic resonance imaging (fMRI) sequences of the brain. We show that brain can be modeled as a functional network, and certain properties of the networks differ in ADHD subjects from control subjects. We compute the pairwise correlation of brain voxels' activity over the time frame of the experimental protocol which helps to model the function of a brain as a network. Different network features are computed for each of the voxels constructing the network. The concatenation of the network features of all the voxels in a brain serves as the feature vector. Feature vectors from a set of subjects are then used to train a PCA-LDA (principal component analysis-linear discriminant analysis) based classifier. We hypothesized that ADHD related differences lie in some specific regions of brain and using features only from those regions are sufficient to discriminate ADHD and control subjects. We propose a method to create a brain mask which includes the useful regions only and demonstrate that using the feature from the masked regions improves classification accuracy on the test data set. We train our classifier with 776 subjects, and test on 171 subjects provided by the Neuro Bureau for the ADHD-200 challenge. We demonstrate the utility of graph-motif features, specifically the maps that represent the frequency of participation of voxels in network cycles of length 3. The best classification performance (69.59%) is achieved using 3-cycle map features with masking. Our proposed approach holds promise in being able to diagnose and understand the disorder.


INTRODUCTION
Attention Deficit Hyperactivity Disorder (ADHD) is a common behavioral disorder affecting children. Approximately 3-5% of school aged children are diagnosed with ADHD. Currently, no well-known biological measure exists to diagnose ADHD. Instead doctors rely on behavioral symptoms to identify it. To understand the cause of the disorder more fundamentally, researchers are using new structural and functional imaging tools like MRI and functional magnetic resonance imaging (fMRI). fMRI has been widely used to study the functioning of brain. It provides high quality visualization of spatio-temporal activity within a brain, which can be used to compare the functioning of normal brains against those with disorders. fMRI has been used for different functional studies of brain. Some of the researchers have used task-related fMRI data, in which the test subjects perform conscious tasks depending on the input stimuli. Others used resting state brain fMRI data. The brain remains active even during rest, when it is not engaged in an attentive task. Raichle et al. (2001) identified several brain areas such as the medial prefrontal cortex (MPFC), posterior cingulate cortex (PCC), and precuneus that are active during rest. These areas form part of a functional network known as the resting-state network or default mode network (DMN) (Greicius et al., 2004;Damoiseaux et al., 2006). The literature (Greicius et al., 2004;Cherkassky et al., 2006;Damoiseaux et al., 2006) tends to use interchangeably the concepts of resting state brain networks and the DMN as defined by Raichle et al. (2001). We compare the brain regions that we have found in the current ADHD data set with the components of the DMN described by Raichle et al. (2001). It is believed that the DMN may be responsible for synchronizing all parts of the brain's activity; disruptions to this network may cause a number of complex brain disorders (Raichle, 2010). Researchers have studied neural substrates relevant to ADHD related behaviors, such as attention lapses, and identified the DMN as the key areas to better understand the problem (Weissman et al., 2006). In this study we use the resting state brain fMRI data and hypothesize that the differences between ADHD conditioned and control brains lie in the variation of functional connections of DMN.
Many studies have been performed to identify functional differences related to ADHD. Most of the approaches use group label analysis to deduce the statistical differences between ADHD conditioned and control groups. Structural MRI analysis suggests that there are abnormalities in ADHD brains, specifically in the frontal lobes, basal ganglia, parietal lobe, occipital lobe, and cerebellum (Castellanos et al., 1996;Overmeyer et al., 2001;Sowell et al., 2003;Seidman et al., 2006). In another set of studies, ADHD brains were analyzed using task-related fMRI data. Bush et al. (1999) found significant low activity in the anterior cingulate cortex when ADHD subjects were asked to perform the Counting Stroop during fMRI. Durston (2003) showed that ADHD conditioned children have difficulty performing the go/nogo task and display decreased activity in the frontostriatal regions. Teicher et al. (2000) demonstrated that boys with ADHD have higher T2 relaxation time in the putamen which is directly connected to a child's capacity to sit still. A third set of work was done using the resting state brain fMRI to locate any abnormalities in the DMN. Castellanos et al. (2008) performed Generalized Linear Model based regression analysis on the whole brain with respect to three frontal foci of DMN, and found low negative correlated activity in precuneus/anterior cingulate cortex in ADHD subjects. Tian et al. (2006) found functional abnormalities in the dorsal anterior cingulate cortex; Cao et al. (2006) showed decreased regional homogeneity in the frontal-striatal-cerebellar circuits, but increased regional homogeneity in the occipital cortex among boys with ADHD. Zang et al. (2007) verified decreased amplitude of low-frequency fluctuation (ALFF) in the right inferior frontal cortex, left sensorimotor cortex, bilateral cerebellum, and the vermis, as well as increased ALFF in the right anterior cingulate cortex, left sensorimotor cortex, and bilateral brainstem.
While group level analysis can suggest statistical differences between two groups, it may not be that useful for clinical diagnosis at the individual level. There have been relatively few investigations at the individual level of classification of the ADHD subjects. One such study is performed by Zhu et al. (2008) who used a PCA-LDA (principal component analysis-linear discriminant analysis) based classifier to separate ADHD and control subjects at individual level. Unlike our network connectivity feature, which can connect all the synchronous regions of the whole brain, they used a regional homogeneity based feature for classification. Also the experiments were performed on only 20 subjects, which are not conclusive.
Our algorithm exploits the topological differences between the functional networks of the ADHD and controlled brains. The different steps of our approach are described in the Figure 1. The input to our algorithm is brain fMRI sequences of the subjects. fMRI data can be viewed as a 4-D video such that the 3-D volume of the brain is divided into small voxels and imaged for a certain duration . The data can also be viewed as a time series of intensity values for each of the voxels. The correlation of these intensity time-series can be an indication of how synchronous the activities of two voxels are, and higher correlation values suggest that two voxels are working in synchronization. A functional network structure is generated for the brain of each of the subjects under study by computing the correlations for all possible pairs of voxels and establishing a connections between any pairs of voxels if their correlation value is sufficiently high. Different network features, such as degree maps, cycle maps, and weight maps are computed from the network to capture topological differences between ADHD and control subjects. We have provided a detailed description of all the network features in the later sections of the article. A brain mask is computed that includes only the regions with useful information to classify ADHD and control subjects. For the rest of the article, we refer to this mask as a "useful region mask." The details of the useful region mask computation procedure are described in section 2.2. Finally, the network features from the voxels within the useful region mask are extracted to train a PCA-LDA based classifier. We have tested the performance of each of the network features computed on the training data set from the Kennedy Krieger Institute (KKI). We selected two different kinds of network features, degree map and 3-cycle map, for the experiments on the full data set.
In our work, we have performed experiments on a large challenging data set which includes subjects from different races, age groups, and data capturing sites. We propose a new approach for the automatic classification of ADHD subjects, and believe that our work will be helpful to the medical imaging community.

DATA
We use the data provided by the Neuro Bureau for the ADHD 200 competition which consists of 776 training subjects and 197 test subjects. Eight different centers contributed to the compilation of the whole data set, which makes the data diverse as well as complex. Different phenotypic information, such as age, gender, handedness, and IQ, is also provided for each subject. Consider Table 1 for an overview of the data set. All research conducted by ADHD-200 data contributing sites was conducted  Correlation Matrix (N x N) 0 1 FIGURE 1 | Overview of our approach. Compute an N × N correlation matrix (N is the number of voxels) using fMRI data; compute the adjacency matrix by thresholding the low correlation values to generate a network; compute network features such as node degree and cycle count for each node of the network; generate a mask for the brain regions which are believed to be most effective for classification; extract feature values within the generated brain mask and classify subjects using the PCA-LDA classifier.
Frontiers in Systems Neuroscience www.frontiersin.org November 2012 | Volume 6 | Article 75 | 2 with local IRB approval, and contributed in compliance with local IRB protocols. In compliance with HIPAA Privacy Rules, all data used for the experiments of this article is fully anonymized. The competition organizers made sure that the 18 patient identifiers are removed, as well as face information. For all our experiments we have used preprocessed resting state fMRI data registered in a 4 × 4 × 4 mm voxel resolution Montreal Neurological Institute (MNI) space, with nuisance variance removed, filtered using a bandpass filter (0.009 Hz <f <0.08 Hz) and blurred with a 6-mm FWHM Gaussian filter. All the fMRI scans are motion corrected to the first image of the time series. We have used a binary mask, provided with each of the subjects, to find out the voxels inside the brain volume. All the fMRI data volumes are of size 49 × 58 × 47 voxels, but the number of sample across the time vary based on the center where data is captured. Further information regarding the data and the preprocessing steps is provided in NITRC (2011).
Though no quality control is performed on the data, a quality score is provided with each image file of all the subjects. The voxel-wise z-scores are thresholded and summed over all the voxels to compute the quality score of a image file. Images with low scores are considered to be better. We have not considered the quality scores for our study.

METHOD
Network motifs such as node degree distribution, cycle, etc. are analyzed in different disciplines of science to understand the systems being studied and neuroscience is not an exception (Milo et al., 2002;Ma'ayan et al., 2008;Sporns, 2002). We used different graph theoretic concepts for our study. We assume that the activity of a brain can be modeled as a functional network where the voxels are considered as the nodes, which are connected with each other based on the similarity of their activity over the time domain. In this article we have used the terms voxel and node interchangeably for the same meaning. The time series of a node is represented as a bold face notation. As the first step of the algorithm, we extract the time series for all the voxels and reorganized it as a separate 2-D matrix for each of the subjects in the data set. This is illustrated in second step of Figure 1. Next, the correlation between all possible voxel pairs is computed. If a subject contains N number of voxels, a correlation matrix of size N × N is constructed, where the ith row of the matrix corresponds to the pairwise correlation values of the ith voxel with all other voxels within the anatomical mask of the subject.
For any two voxels, if the time series are u and v, respectively, the correlation can be computed as, where T is the length of the time series, We normalize all the time series between [−1, 1] before correlation computation. Next, we threshold all the values of the correlation matrix to get a binary map of zeros and ones. This binary map can be considered as the adjacency matrix of a graph where the ith voxel is connected to all the voxels for which nonzero values are present in the ith row of the matrix. Note that we can consider two voxels to be connected by an edge when the correlation is high positive, high negative or simply the absolute value of the correlation is high. We have computed three different networks considering high positive, high negative, and high absolute correlation values, respectively.

Network feature computation
Once the graphs are constructed, for each subject of the data set, we compute different network features which can provide certain functional differences between the activity patterns of ADHD and control subjects' brain. The feature values from all the voxels of a network construct the feature map such as degree map, cycle map, etc. The descriptions of different network features computed are given below.

Degree.
For each node in a network, the degree is the count of the other nodes it is connected to. In other words, the degree of a node is the number of edges attached to it.

Varying distance degree.
Instead of considering the count of all the edges of a node as its degree, we group the edges based on their physical length and compute a separate degree for each of the groups. So, if we have n threshold values for edge length, say {l 1 , l 2 , ..., l n }, we can compute n degrees,

L-cycle count.
A path in a network is a sequence of distinct nodes which can be traversed in the given order using the connecting edges. A cycle, on the other hand, is a closed path in the network where the starting and ending node is the same and all other nodes are distinct. The L-cycle count of a node is the number of all possible distinct L length cycles containing The degree of the node, highlighted in yellow, is the count of all the green nodes connected to it (i.e., 8), while the varying distance degree is the counts of all the connected nodes in each of the bins defined by the three edge length thresholds (l 1 , l 2 , l 3 ) marked in blue. In this example the varying distance degrees of the yellow node are {4, 2, 2}. (B) Shows all the distinct 3-cycles that containing the node 3. the node. Figure 2 illustrates this idea. L-cycle count for a node is calculated by traversing through all the L-length path starting from the node and counting the paths which leads to the starting node. The traversing can be performed using the breadth first search algorithm. We have used different cycle lengths for our experiments.

Weight sum.
Instead of constructing an adjacency matrix using a threshold on the correlation values, we assume every node is connected to all other nodes by the weighted edges. The weight of the connecting edge of a node pair is their correlation value. As the correlation values can be positive and negative, we can separately add up all the positive, negative and absolute edge weights of a node to get its sum of positive, negative and absolute weights.

PCA-LDA classification
Once we finish computation of the network features, we extract the features from all of the voxels within the useful region mask. The mask generation algorithm is described in the next subsection. Concatenation of the feature values extracted from all the voxels generates a feature vector per subject. A PCA-LDA based classifier is trained separately using different set of feature vectors computed for different types of network features. Finally, the classifier is used for automatic classification of the ADHD subjects.
It is expected that the characteristics of the networks computed are represented by their feature vectors. A feature vector of a network represents a point in the feature space where the dimensionality of the space is same as the length of the vector. If the feature vectors of ADHD and control subjects are separable then their corresponding points in the feature space should cluster in different locations. When a classifier is trained, it learns to partition the feature space in such a way that the feature vectors from each of the groups are ideally clustered in separate segments. Given a feature vector of a test example, the classifier can identify which segments of the feature space it belongs to and classify the test subject accordingly. LDA is a widely used data classification technique which maximizes the ratio of between-class variance to the within-class variance to produce maximal separability. Mathematically, the objective is to maximize the following function: where S B and S W are between class and within class scatter matrix, and can be formulated as follows: n A and n C are the number of subjects, µ (A) and µ (C) are the mean feature vectors, x i A and x i C are the ith feature vectors of the ADHD and control group, respectively. In many cases, the dimension of the feature space becomes so high that the proper partitioning of the space is difficult. For example, in our case, the dimension of the feature space is the number of voxels within the useful region mask which is several thousands. Again, most of the dimensions do not contain any significant data variance. PCA is a procedure to find out a set of orthogonal directions, called principal components, along which the variance of the data is maximum. It then projects the data into the smaller dimensional subspace composed of the principal components. The classifier can work efficiently on the subspace which is significantly smaller in dimension than the original feature space. We use first 40 and first 100 principal components for the experiments on KKI and full data set, respectively, as they cover more than 98% of data variance. We have included a plot of principal component vs. percent of data variance in the supplementary materials. Refer to Abdi and Williams (2010) for details about PCA.

USEFUL REGION MASK
Different research studies have proposed several regions of interests (ROI) for fMRI analysis. These different ROIs vary in size and number. In some studies they are identified based on the anatomical structure of the brain and in other studies they depend on the functional responsibility. Tzourio-Mazoyer et al. (2002) identified the ROIs based on similar functional responses in the brain. Craddock et al. (2011) generated a homogenous functional connectivity map from resting state fMRI data. Smith et al. (2009) identified several co-varying functional subnetworks in the resting state brain. However, it is still unclear which ROIs are the best for resting state functional connectivity analysis. Also it is not known if all the ROIs detected by one method are required for ADHD classification or if the use of a subset of ROIs is more efficient. To find these answers, we use a novel method to identify the useful region mask for the classification of ADHD and control subjects. The algorithm for the useful region mask generation is as follows: Step 1 For each of the subjects, used for mask generation algorithm, we do the following: • Divide the brain volume into small cube-shaped regions. Each of the regions is typically 5 × 5 × 5 voxels except the regions at the boundary of the brain volume. • Select a random subset of the regions. We include each region in the subset with probability p. • Generate degree map by extracting the degrees for the voxels within the selected subset of regions.
Step 2 Train the PCA-LDA based classifier and calculate the detection accuracy on the test data set.
Step 3 Perform the step 1 and step 2 for m number of times, each time generating a different random subset, calculating the detection accuracy and recording it.
Step 4 Choose the random sub sets corresponding to the top 10% of the detection accuracy as the candidates for generating the useful region mask. We count the occurrence of each of the regions in all of the candidate sub sets and normalize the counts between 0 and 1 by dividing it by the number of candidate sub sets. This gives us the probability of inclusion of each of the regions in the mask.
Step 5 Finally the useful region mask is generated using a threshold th to prune the regions with low probability.
We experimentally verified that highest detection rate achieved when p is 0.40 and th is 0.60. The experiment results are included in the supplementary materials. The value of m was kept as 500 so that the number of iterations should be large enough but computationally feasible. Figure 3A is an illustration of the proposed algorithm on a cartoon 2-D slice of a brain while Figure 3B is the flowchart for the mask generation algorithm. Note that other network features may also be used in the algorithm but we simply use degree map feature. We assume that the regions, which are useful for identifying ADHD conditioned brains, should not vary depending on the feature used for the detection of the mask. We have tested the idea computing useful region mask using 3-cycle map feature also. We found that the final detection rates are very similar (check the supplementary materials).

EXPERIMENTS AND RESULTS
First, we verified the performance of each of the network features computed on a subset of the training data. We used fMRI data of 83 subjects from the KKI data set. Among the 83 subjects, the first 44 subjects are used for training and the remaining 39 for testing. The performances of each of the features is computed with or without using the useful region mask. The mask is generated on the KKI training set comprising the first 44 subjects of the KKI subset and using the algorithm described in section 2.3. Each time a random subset of regions is selected, the classification performance is measured by leave-one-out cross verification, i.e., take 43 subjects for training and test on the remaining one subject, repeat the process 44 times, testing each of the 44 subjects one at a time and averaging the correct detection count. Figure 4 shows the computed mask on different slices of the brain. Table 2 list the information of the different clusters found in the useful region mask and the ROIs they are overlapped with. To empirically select the correlation threshold to be used for our experiments, we varied it from 0.4 to 0.8 with an increment of 0.1 in every step. In each step, detection rates for different network features are computed on the KKI test set of 39 subjects. The plots for correlation threshold vs. detection rate are shown in Figure 5. To generate the plot for the weight map, we compute the sum of the edge weights considering only the edges which have weights greater than the correlation thresholds used within that step. Note that the detection rate for each feature is measured for positive, negative and absolute correlation values. However, the features computed from the positive correlation values have always outperformed the other two cases. Hence, we have not reported the other two cases in the paper. Since for all    the network features, other than the 4-cycle map, the best performance is consistently achieved when correlation threshold is 0.80, we choose to use this value for all the experiments on the full data set. Table 3 summarizes the best performance obtained for each of the network features and the corresponding correlation threshold values. The performance in the table signifies the percentage of total number of correct detection (control and ADHD) among total number of test subjects. Note that for all the features, the performance without using useful regions mask is lower compare to when we use the mask. This demonstrate the utility of the voxel selection through the generated mask. In one of the recent studies Solmaz et al. used Bag of Word features for automatic classification of the ADHD subjects (Solmaz et al., 2012). We used their method for the purpose of comparison of the performances with our method. For our experiments using the Bag of Words feature, each subject is represented by 75 and 100 bin histograms when we used raw time series and degree map features, respectively. A third kind of experiment performed by representing each of the subjects as a concatenation of two types

Comparison of the performances are shown when useful region mask is used and not used for the degree map and 3-cycle map features.
of histograms resulting in a 175 bin histogram. The details of the Bag of Word method are provided in the supplementary materials. We perform thorough experiments on the full data set using positive degree map and positive 3-cycle map features. We trained our classifier with the full training data, which has 776 subjects from 7 different centers, and test on the 171 subjects from 6 centers released for the ADHD-200 competition. Again, we compared the performance with and without using the useful region mask. We reused the same mask generated using first 44 subjects of KKI. It is worth mentioning that the mask selects 6916 voxels from which features are extracted. The correct detection rate, specificity and sensitivity for each of the test centers and for overall centers are reported in Table 4. Since the subject labels of the Brown University test set have not yet been released, we cannot compute the performance measures on that subset.

DISCUSSION
We have modeled the brain as a functional network which is expected to represent the interaction of the different active regions of the brain. We assumed that ADHD is a problem caused due to the partial failure of the brain's communication network and the affected subjects can be distinguished from control subjects using the topological differences of their respective functional networks. To verify the idea, we have extracted different network features to train a PCA-LDA based automatic classifier. Figure 6 shows that the average degree map, computed for the ADHD and control subjects of the KKI data set, is able to capture some difference of connectivity in the Cingulate Gyrus and the Paracingulate Gyrus regions of brain. We also proposed that the features from the whole brain are not required for the classification, but some key areas hold useful information. Our results shows that the inclusion of features from the whole brain can negatively impact the classification accuracy. This resulted in a novel algorithm to compute the useful region mask which helped to improve the classification performance.
The different network features computed are expected to capture different characteristics of the functional network. The degree map and the weight map can capture how densely the nodes of the network is connected. This can give us a measure of how synchronously different regions of a brain are interacting. Varying distance degree map, on the other hand, can also reveal the fact that how the synchronous regions are distributed over the brain. While degree map only captures pairwise interactions of voxels, it ignores higher-order interactions, such as among three voxels simultaneously. We know from brain anatomy that there are such multiply connected brain regions. Hence, cycle maps offer a different perspective from which a given network may be viewed. The utility of using network motifs such as cycles to describe networks has been described in Milo et al. (2002). Figure 4 and Table 2 presents the ROIs found through our adaptive labeling technique described in section 2.3. These ROIs were used in the classification including regions such as the cingulate and precuneus which is consistent with the findings of The average difference is calculated using the 83 subjects of KKI training set. The dark red to white color map is used to represents higher degree of control subjects and blue to green color map is used to show the opposite. The control group shows higher connectivity in the Cingulate Gyrus region on slices with Z coordinates 10 and 15 and Paracingulate Gyrus region on slices with Z coordinates 19 and 23. Castellanos et al. (2008). The cingulate and precuneus regions are known to be part of the DMN (Damoiseaux et al., 2006). Many regions in the Table 2 have also been identified by Assaf et al. (2010), such as the precuneus, temporal pole, superior temporal gyrus, and pre-central gyrus. Regions in Table 2 that are consistent with those reported by Uddin et al. (2009) include the inferior temporal gyrus and lingual gyrus. Interestingly, Table 2 identifies the right amygdala, which did not show up in the analysis of Castellanos et al. (2008) or Assaf et al. (2010) or Uddin et al. (2009). The limbic system is known to play a role in ADHD, and a study by Plessen et al. (2006) reported disrupted connectivity between the amygdala and OFC in the children with ADHD. Hence the value of our technique is that it provides an independent and automatic source of hypotheses about the brain regions that are implicated in the diagnosis and classification of ADHD. In this sense, our technique for ROI identification can be considered to be a model-free method. Furthermore, our classifier is agnostic to any particular theory of ADHD, and works strictly on a machine-learning approach to separating ADHD patients from controls by utilizing labeled data. Hence the technique described in this paper is applicable to other types of brain disorders where one can create labeled data for the accompanying brain scans. The curves in Figure 5 show that for all the network features, high performance value is achieved when correlation threshold 0.80 is used to construct the network. In four out of seven cases the performances are the highest, in other two cases they are one of the highest and in one case it is slightly lower that the highest. The results are not surprising since they indicate that the difference of connection structure for highly correlated voxels matters the most for classification.
Considering the results in Table 4, we observe that in five out of six data sets, the 3-cycle maps with voxel selection give the best detection rate. Only on one data set, the Peking data set, the 3-cycle map with voxel selection gives marginally worse performance than the degree map with voxel selection. To the best of our knowledge, this is the first time that the utility of cycle-related features has been demonstrated in the fMRI imaging literature. The study in Ma'ayan et al. (2008) showed that cycle-related features are useful in discriminating biological networks from man-made networks, but did not investigate various types of fMRI-derived networks.
We note that calculating cycle-related features is more computationally intensive than the degree map, and the computation increases exponentially with cycle length. The use of GPUs can reduce the cost of computation, as earlier studies with fMRI images have shown Rao et al. (2011). If standardized libraries for cycle computation become available on GPU platforms, it will promote the use of such features in fMRI research.
The use of the degree map provides a good compromise between classification performance and computational cost. It is easy to compute, and provides classification performance that is only marginally worse than that of the 3-cycle maps in most cases. One limitation of our study is that we have not used any specific measure to remove different signal to noise ratios which may be introduced in the data due to the difference of experimental setups among the sites. Also, some of the recent studies (Power et al., 2012;Van Dijk et al., 2012) indicate that the correlations of different brain regions are sensitive to the motion of the head even though the data is preprocessed for motion correction. We have not performed any explicit step to counter this problem. Finally, we note that we used a single classifier, the PCA-LDA method to investigate the utility of different network features. It is possible that other classifiers such as neural networks or support vector machines may give better performance. Such investigations need to be carried out in the future.

ACKNOWLEDGMENTS
The project described was supported by Award Number R21CA129263 from the National Cancer Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.
Special thanks to the Neuro Bureau and all the data contributing sites for their efforts in compiling the large data set and making it publicly available. The goal of this project is that different disciplines of science may help to better understand the neural basis of ADHD.

APPENDIX Experiments for Finding Best p and th Value
We varied the probability p of including a region in the random subset and the final threshold th used on the occurrence probability map of the regions to generate different useful region mask. Please check the useful region mask generation algorithm in the section 2.3 of the main article for the details of p and th. For each pair of values of p and th, we compute a different useful region mask which is used to generate different detection rates on the KKI data set. The detection rates are reported in the Figure A1. The best performance is achieved when p = 0.4 and th = 0.6. We used these values for the generating the final useful region mask.

The Map of Region Occurrence Probability
In the useful region mask computation algorithm in the section 2.3 of the main article we used top 10% of the random subsets generated as the candidate subsets for generating the final mask. We compute the number of occurrence of each of the regions in the candidates subsets and divide it by the total number of candidate subsets to generate the region occurrence probability map. This map is reported in the Figure A2 as per the request of reviewer 2.

Useful Region Mask Using 3-Cycle Map Feature
We assume that the regions, which are useful for identifying ADHD conditioned brains, should not vary depending on the feature used for the detection of the mask. To justify our assumption we generate another useful region mask on the KKI data set using 3-cycle map features. The mask generated is used to verify the detection rates on the test data sets released for the ADHD-200 competition. The experiment results are reported in the Table A1. The detection rates we got using the mask generated with 3-cycle map features and using the mask generated with positive degree map features are almost same. This matching results supports our initial assumption. The Figure A3 shows the mask plotted on the different slices of a brain.  The useful region mask, generated using 3-cycle map features, is used for identifying the regions to extract the features.

Principal Component and Data Variance Analysis
The Figure A4 show the plots for the number of principal components vs. the percentage of the total data variance captured. For the KKI training data set, the first 40 principal components are able to capture 99.8% of total data variance while the first 100 principal components of the full training data set are able to capture 98% of the total data variance.

Bag of Words for ADHD Classification
Bag of Word (BoW) model was first introduced in natural language processing. The main idea of BoW is that a document can be represented by the histogram of the counts of different words consisting the document. The order of the words or the grammar of the language is immaterial. Again, different documents on the same topic should share similar histogram pattern while the patterns of the histograms of the documents on different topics should differ. A similar idea is used by Solmaz et al. (2012) for the classification of ADHD subjects. For the purpose of constructing vocabulary of the resting state fMRI data, the authors cluster the time series of the voxels in all the subjects of the training set. This step groups the similar time series in the same group. The mean time series for each of the groups construct the vocabulary of the fMRI data. Hence, we can say that the number of clusters formed is the number of different words the resting state fMRI data can have. Now given anytime series of a voxel it can be labeled to the group number of the closest group. Hence, each subject can be represented as a histogram of word count based on how many voxels of the subject belongs to which group. The histogram of the subjects serves as their feature vector. Now, given a training and test data set, a classifier can be trained on the histogram of the training subjects and used to classify the test subjects.

p-Values
The Figure A5 shows the p-values corresponding to the detection results shown in the Figure A4 of the paper. The p-value of a classification can be interpreted as the probability that the classification accuracy can be achieved if the classifier is random. For example, if m subjects are correctly classified among n test subjects then p-value for the classification would be the probability of m or more correct detections if classifier detects using random chance. The lower the p-value the lower the chance that the classification is not random.