The Importance of Anti-correlations in Graph Theory Based Classification of Autism Spectrum Disorder

With the release of the multi-site Autism Brain Imaging Data Exchange, many researchers have applied machine learning methods to distinguish between healthy subjects and autistic individuals by using features extracted from resting state functional MRI data. An important part of applying machine learning to this problem is extracting these features. Specifically, whether to include negative correlations between brain region activities as relevant features and how best to define these features. For the second question, the graph theoretical properties of the brain network may provide a reasonable answer. In this study, we investigated the first issue by comparing three different approaches. These included using the positive correlation matrix (comprising only the positive values of the original correlation matrix), the absolute value of the correlation matrix, or the anticorrelation matrix (comprising only the negative correlation values) as the starting point for extracting relevant features using graph theory. We then trained a multi-layer perceptron in a leave-one-site out manner in which the data from a single site was left out as testing data and the model was trained on the data from the other sites. Our results show that on average, using graph features extracted from the anti-correlation matrix led to the highest accuracy and AUC scores. This suggests that anti-correlations should not simply be discarded as they may include useful information that would aid the classification task. We also show that adding the PCA transformation of the original correlation matrix to the feature space leads to an increase in accuracy.


INTRODUCTION
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that is growing in prevalence in recent years (Zablotsky et al., 2015). While it is usually diagnosed by carefully monitoring a child's behavioral development (Barbaresi et al., 2006), recent studies have shown that brain imaging can also be used to aid in that diagnosis by identifying underlying differences between the ASD and Healthy Control (HC) Brain (Dichter, 2012;Bos et al., 2014).
Functional Magnetic Resonance Imaging (fMRI) is a widely used tool for such studies due to its high spatial resolution. It monitors the changes in the Blood Oxygen Level Dependent (BOLD) signal which indirectly measures the neuronal activity (Sotero and Trujillo-Barreto, 2007). By computing functional connectivity measures between BOLD signals from different brain areas, researchers can examine the human brain at a network level. A variant of the technique called Resting State fMRI (rs-fMRI) has been widely used to examine brain networks while subjects are at rest (Fox and Greicius, 2010;van den Heuvel and Hulshoff Pol, 2010). Graph theory is one of the more novel methods being used for the network-level analysis of the brain. It provides a mathematical framework for quantifying network characteristics and quantitatively analyzing the differences between different brain networks (Bullmore and Sporns, 2009;Rubinov and Sporns, 2010). Machine learning is another relatively new technique that is being applied to rs-fMRI data to extract insights such as important biomarkers as well as to develop novel algorithms with the hope of automatically diagnosing brain disorders from medical imaging data (Pereira et al., 2009;Lee et al., 2013). The network characteristics that is provided by graph theory has been used as the input to machine learning algorithms to identify diseases such as Alzheimer's (Khazaee et al., 2016), Parkinson's (Kazeminejad et al., 2017), and Autism (Kazeminejad and Sotero, 2019b).
One of the variables that can affect the results of the mentioned analysis is how the graph construction step was performed. After the initial preprocessing of the rs-fMRI data, a correlation matrix is created by calculating the correlation between the activation time-series of different brain regions. The correlation matrix is then transformed into a sparse binary matrix representing the existence connections between different regions. This transformation is usually performed by a thresholding step in which only a percentage of the strongest correlations are kept (Bullmore and Sporns, 2009;Rubinov and Sporns, 2010). However, this step ignores the anti-correlations which may be biologically relevant (Fox et al., 2005). This problem will be even more severe if the preprocessing pipeline includes the regression of the global mean signal (GSR) from the time-series, known to result in the removal of motion, cardiac and respiratory signals, which may result in the presence of more anticorrelations in the correlation matrix (Murphy and Fox, 2017). The use of GSR is a controversial matter in the field. Although there is evidence suggesting that the anticorrelations introduced through GSR have no biological basis (Murphy et al., 2009), a recent study has shown using GSR in the preprocessing pipeline for rs-fMRI leads to better prediction accuracies of behavioral measures (Li et al., 2019).
There has been evidence of network-level changes in the ASD brain compared to a HC brain (Redcay et al., 2013;Rudie et al., 2013;Bos et al., 2014). Therefore, using graph theory to extract features for classification of ASD is likely to provide good results. This claim was previously examined by training a SVM classifier on graph theoretical features to classify between ASD and HC (Kazeminejad and Sotero, 2019b). However, the methodology of that paper was limited to using only the positive correlations in order to construct the brain graph, potentially ignoring some informative connections.
The release of the Autism Brain Imaging Data Exchange I (ABIDE I) dataset (Di Martino et al., 2014) allowed researchers to examine ASD in large sample sizes. Nielsen et al. (2013) conducted one of the earliest classification studies on ABIDE and were able to achieve an accuracy of 60% over all samples.
More recently, Heinsfeld et al. (2018) were able to achieve an accuracy of 70%, evaluated by 10 fold cross-validation, on the entire dataset by utilizing neural networks and transfer learning. They also reported the leave-one-site out performance of their model averaging at 65% accuracy. Plitt et al. (2015) achieved a higher accuracy, 69.71% when using only 178 subjects from the ABIDE I dataset. Another study by Khosla et al. (2018) used 3D convolutional neural networks to achieve cross validated accuracies as high as 71.7% on a subset of the ABIDE I dataset.
A recent study by Hallquist and Hillary (2019), examining 106 papers (with only 2 papers focusing on ASD) using graph theoretical measures, shows that 79.2% of graph theoretical studies either did not specify how they handled negative correlations or discarded them. Another 9.4% used absolute values of the correlation matrix. However, this is mostly an arbitrary choice.
In this paper, we study the effect of different approaches to handling anti-correlations for the classification accuracy of a machine learning model on the ABIDE dataset. We trained a regularized deep learning neural network using features extracted from transforming the correlation matrix using three pipelines: The positive correlation pipeline which does not change the matrix, the anti-correlation pipeline which prioritizes negative correlations, and the absolute value pipeline which disregards the sign of the correlations. These pipelines are explained further in Figure 1. Our model was evaluated using a leave-one-site out approach. Our results suggest that on average, the anticorrelation pipeline results in better accuracy and area under curve (AUC) score with a small loss in sensitivity. Interestingly, adding the PCA transformation of the original correlation matrix increased the accuracy, sensitivity, specificity and AUC score for all pipelines.

Data and Preprocessing
This study used a publicly available dataset from the ABIDE Preprocessed Initiative (Cameron et al., 2013a). To ensure that our results were not affected by any custom preprocessing pipeline, we used the preprocessed data provided by ABIDE in the C-PAC (Cameron et al., 2013b) pipeline. The preprocessing included the following steps. The Analysis of Functional Neuro Images (AFNI) (Cox, 1996) software was used for removing the skull from the images. The brain was segmented into three tissues using FMRIB Software Library (FSL) (Smith et al., 2004). The images were then normalized to the MNI 152 stereotactic space (Mazziotta et al., 2001;Grabner et al., 2006) using Advanced Normalization Tools (ANTs) (Avants et al., 2011). Functional preprocessing included motion and slice-timing correction as well as the normalization of voxel intensity. Nuisance signal regression included 24 parameters for head motion, CompCor (Behzadi et al., 2007) with 5 principal components for tissue signal in Cerebrospinal fluid and white matter, linear and quadratic trends for Low-frequency drifts and a global bandpass filter (0.01-0.1 Hz). GSR was also applied to remove the global mean from the signals. These images where then co-registered to FIGURE 1 | Graphical framework of the experiment. (A) By averaging the BOLD activity in each ROI in the parcellation atlas, a time series is extracted representing brain activity in that region; (B) Using Pearson's correlation, a connectivity matrix is generated from the ROI time series quantifying the connectivity level between individual ROIs; The connectivity matrix is transformed using three pipelines: Pos. No change to the connectivity matrix, Neg: Multiplying the connectivity matrix by −1, Abs: calculating the absolute value of the matrix. Then, by treating the ROIs as graph nodes and the connectivity matrix as graph weights the brain network is expressed in graph form; (C) A threshold is applied to keep only the 20-50% strongest connections in 2% increments; (D) Graph theoretical analysis is applied to the resulting graph from to obtain a feature vector for each subject; (E) Feature matrix for all subjects in the training fold; (F) MLP architecture. First ReLU layer is l1-regularized and second ReLU layer is l2-regularized. A dropout of 0.7 was applied between the first and second ReLU as well as the second ReLU and output; (G) The model is tested on a previously unseen test set from a different site in the ABIDE dataset. The evaluation metrics are: Accuracy, Sensitivity, Specificity, AUC score. their anatomical counterpart by FSL. They were then normalized to the MNI 152 space using ANTs. The average voxel activity in each Region of Interest (ROI) of the Craddock 200 atlas (Craddock et al., 2012) was then extracted as the time-series for that region. Furthermore, in order to replicate our results on the Craddock 200 atlas, we reran the pipelines on the anatomical automatic labeling (AAL) (Tzourio-Mazoyer et al., 2002) atlas. Overall, this study had 1,035 subjects, the demographics of whom are outlined in Table 1.

Network Construction, Graph Extraction and Feature Extraction
To construct the brain network, the timeseries for each atlas ROI where correlated, using Pearson correlation, with the other regions. The strengths of these correlations were used as the strengths of the connection between different ROIs. In graph terms, this represents a fully connected graph with each of the nodes being in the center of the corresponding ROIs and each edge weight is the correlation between the two nodes on the opposite ends of the edge. Three different pipelines were obtained based on selecting only the positive values, only the negative values, or taking the absolute value of the correlation matrix. These graphs were then subjected to a thresholding step in which only the 20-50% strongest connections were set to one and the other edges were discarded. This threshold was incremented in 2% steps. This resulted in sparse binary graphs. This step was done because binary graphs have been shown to have more easily defined null models and are more easily characterizable (Rubinov and Sporns, 2010). Several measures of integration (characteristic path length and efficiency), segregation (clustering coefficient and transitivity), centrality (betweenness centrality, eigenvector centrality, participation coefficient and within module z-score) and resilience (assortativity) (Bullmore and Sporns, 2009) were M: Male, F: Female, ASD: Autism spectrum disorder, HC: Healthy control, SD: Standard deviation. Some sites did not have any female individuals, these have been identified with a "-." The term "X" for SD means that there were not enough samples to calculate SD.
then extracted from the binary brain network graph. These steps were done using the Python libraries brainconn ( 1 FIU-Neuro/brainconn) and network 2 . This resulted in 3 feature sets of 1,404 features for each group. A breakdown of these features can be found in section 6 of the Supplementary Material. The main body of this paper focuses mainly on only using the graph features in the analysis. In the Supplementary Material, we report the results of adding the PCA transformation (Bartholomew, 2010) of the unaltered correlation matrix to the graph features. These principle components were calculated for each individual and we have 600 components per individual.

Leave-One-Site-Out Cross-Validation
The structure of the ABIDE I dataset allows for an interesting cross-validation approach that captures the multi-site nature of it. Our data was divided into 17 cross-validation sets. In each one, one site was used as the test set and the other 16 sites were used as the training set.

Model Training and Evaluation
In this study, we used a multilayer perceptron with two hidden layers. The model was implemented with Keras and using the Tensorflow 1.13 backend (Khosla et al., 2018). This model was chosen due to its ability to construct data driven features before using them for the final classification task. Both layers consisted of 512 rectified linear neurons. Due to the limited number of samples, the model was heavily regularized. The first hidden layer is subjected to L1 regularization in order to force the network to have some feature selection capabilities. A dropout layer is then applied before the second hidden layer. L2 regularization was used in the second hidden layer. Finally, a dropout layer was included before the output layer. The output layer is a single neuron which is activated by a sigmoid function.
The model training process is as follows. First, each feature was standardized by removing the mean and scaling to unit variance. Then the training data was ran through the neural network. The model used binary cross-entropy as its loss function and an Adam optimizer with β 1 = 0.99, β 2 = 0.01, and an initial learning rate of l = 0.0002. The learning rate was decayed by a factor of 0.5 base on the validation loss calculated on a random 10% of the training data withheld during the process. The model parameters were tuned on only one of the left out sites (PITT) to ensure low information leak. The trained model was then evaluated on the test set. We repeated the above steps 5 times for each Leave-One-Site-Out (LOSO) fold and report the average accuracy, sensitivity, specificity and AUC score over the 5 repetitions. It is worth noting that the healthy subjects were given a label of 1 and the ASD subjects were given label 0. Thus, sensitivity should be interpreted as the percentage of HCs correctly identified. Likewise, specificity is the percentage of ASDs correctly identified.
We used two approaches to aggregate the results of each left out site. First, we report the overall average metric for each of these sites as "average." We also report a weighted average by using the sample size of each left out site as the weight of that site.

Statistical Evaluation
We performed two statistical analyses to compare model results for each left out site between different pipelines. First, we compared the average of the 5 trials performance of the different models for each pipeline for each site against each other using a Welch's t-test (Welch, 1947). The average was calculated by computing the mean of each metric for each trial over the different thresholds. This was corrected for multiple comparisons using the Holms-Sidak method (Cardillo, 2006). We also compared model performance in a threshold-wise manner, comparing each threshold results for each pipeline against the same threshold results in another pipeline using the same test. Furthermore, we also report the p-value of the mean of each performance metrics between the three pipelines by conducting a paired t-test based on the results of individual sites. The significance level is set to p < 0.05 and all tests are two-tailed.

RESULTS
In order to manage the different thresholds, we report the average performance measures and their standard deviation across the different thresholds. This applies to all following paragraphs unless explicitly specified. The effects of this choice are discussed further in section "Discussion." Our results in Table 2 show that, on average, the negative correlation pipeline can achieve higher accuracies and AUC score than the other two pipelines. The average sensitivity of the model remains comparable over the pipelines while being slightly lower for the absolute value pipeline. The model was able to achieve a higher specificity, increasing its ability to correctly identify ASDs. This is interesting because it suggests different feature extraction pipelines will allow some flexibility between interchanging sensitivity and specificity.
The same results held when adding the PCA transformation of the original correlation matrix as input features. However, the overall performance of the model was improved. Suggesting that information from the original correlation matrix supplements the information available in the graph features (Supplementary Table 1). To test whether these observations were dependent on the choice of the atlas, we applied the same methodology on a different atlas When using the AAL parcellation, the negative pipeline performed the best as well. Thus, strengthening the hypothesis that this result holds across different parcellation atlases (Supplementary Table 2). Another variable in the dataset was gender. The negative correlation pipeline was still the best performing pipeline on average when only modeling the male subjects. However, the difference between the absolute correlation and negative correlation pipelines where negligible (AUC of 0.59 vs 0.592). Interestingly, overall model performance was lowered when the training and test data only included from male subjects (Supplementary Table 3).
As evident in the performance tables, there is a high variability in model performance between sites. One possible explanation for this may be related to the age of the subjects in that site. To investigate this, we plotted the average AUC scores of the 5 trials for each site against their age means. A quadratic regression was applied on this data and is shown in Figure 2. In the case of added PCA features, the sites at opposing ends of the age mean scale performed worse than the sites closer to the mean. This phenomenon was less pronounced in the case of using only graph features with it being almost non-existent in the case of the positive correlation pipeline. Furthermore, we report the R-squared values and F-test p-values of these plots. While most regression curves show statistical significance, the quadratic regression is shown to be a poor fit for the results from the positive pipeline when using only graph features.
Our results show that using different pipelines may play a significant role in the classification outcome for certain sites. Figure 3 shows how these pipelines compare across different site when the same threshold is applied. We report statistically significant differences between the specificity of the Yale site model between the absolute value and positive pipelines as well as between the anticorrelation and absolute value pipeline. Such significant differences can be spotted elsewhere for the PITT site in the case of absolute value vs anticorrelation pipeline as well as to a more limited scale across different sites and pipelines. This suggests that between site differences such as subject demographics and equipment differences could be used to inform the choice of which pipeline to use.
Supplementary Tables 4.1 and 4.2 show the p-values of between pipeline differences compared to the anti-correlation pipeline for the graph-features-only method. When comparing the best models, based on what threshold led to the highest AUC score, trained with only graph features for each site, the negative pipeline achieved a higher AUC than the other pipelines except in the following sites. OLIN, OHSU, NYU and SBL for the unchanged pipeline and OHSU and NYU for the absolute value pipeline. Of these sites, none were significantly different (p < 0.05) between the unchanged and anticorrelation pipelines. For accuracy, the same comparison holds between the anticorrelation and unchanged pipelines. OSHU (p = 0.042) and NYU (p = 0.043) results were the only significantly different sites. The absolute value pipeline was able to achieve better accuracies than the negative pipeline for the following site: OLIN, OHSU, NYU. Likewise, the AAL atlas results did not show many statistical significances in the site accuracies between different pipelines.
As it may not be fair to compare different thresholds for each site, we also performed threshold matched comparisons for the graph-only craddock-200 atlas results. A paired t-test was performed on the performance metrics of each pipeline pair for each site. Figure 3 shows these comparisons in a heatmap style plot. Interestingly, no site showed consistent difference in performance (AUC) over the tested threshold range, suggesting that this value should be treated as a parameter than can affect the accuracy of machine learning predictions using graph theory and tuned for the task at hand.
Another paired t-test was performed comparing the out of site fold performance of the three pipelines with just graph features ( Table 3). The negative correlation pipeline had statistically different accuracy and AUC metrics when compared to the absolute value pipeline. No other significant differences were found between the pipelines.  Top row plots the results from using both PCA and graph features while bottom row shows result from using only graph features. The sites that have extreme age ranges (either lowest or highest) perform worse than those closer to the population median. An inverted U-shape is observed in all pipelines other than the Pos pipeline in the case of using only graph features. Abbreviations: Pos, Positive pipeline; Neg, negative pipeline; Abs, Absolute value pipeline.
Although not the main goal of this study, our model was able to, based on the weighted accuracy, perform on-par with the deep learning model described in Heinsfeld et al. (2018) when using the graph and PCA features. The sites CALTECH, KKI, MAX_MUN, OHSU, SBL, STANFORD and TRINITY showcased lower accuracy in our model. Section 4 of the Supplementary Material reports the LOSO results of 3 other models: Logistic regression, random forest classification and gaussian support vector classifiers. These tables show that our neural network model was able to outperform these 3 classical machine learning algorithms in most sites.

DISCUSSION
Our results consistently showed that, on average, using graph theoretical features from the negative pipeline increases the MLP model accuracy in distinguishing between HC and ASD. This suggests that anti-correlations in the brain network contain important information that helps our model in distinguishing between ASD and HC. One possible reason for the higher performance of the anti-correlation method could be that contrary to some previous studies (Murphy et al., 2009;Murphy and Fox, 2017), GSR introduced anti-correlations may in fact stem from real neurological basis and are not spurious (Khosla et al., 2018). Unfortunately, as we do not know of any gold standards to test this hypothesis in real data, it should only be interpreted as a speculation made based on the results of this study.
Previous studies have concluded that the use of absolute value graph metrics in the presence of GSR may be compromised (Bartholomew, 2010). This was attributed to the fact that in the presence of GSR, the topologies of the anti-correlation and correlation matrix will be mixed when using an absolute value pipeline as the anti-correlations have comparable magnitude to the correlations. However, our model trained with the absolute value graph features was able to perform on-par with the positive correlation pipeline.
The low number of significantly different within-site metrics between the three pipelines may be attributed to the way they were calculated. Each LOSO fold was run 5 times in order to offset any effects of random weight initialization and random training-validation set splits. Increasing the number of trials should add to the strength of this test by capturing the metric distributions more accurately. Another possible explanation is that there is enough information available in all pipelines for most of these sites.
In the case of added PCA, plotting the accuracies of the negative pipeline against number of participants did not provide further insight into why some sites performed worst. Plotting the same against the age of the participants revealed that there may be a link between age and the ability of our model to make accurate predictions. Age has been previously shown to affect network properties of subject within the ABIDE dataset (Henry et al., 2018). KKI and STANFORD are the two sites with the lowest average ASD and HC age. Of the 4 sites with the highest average ASD age, 3 of them, CALTECH, MAX_MUN and SBL showed lower accuracies. Interestingly, CMU, did not follow this behavior. However, the standard deviation of the age of CMU's participants was lower than those of the other 3. The larger age SD may have hindered the algorithm's ability to perform well for the sites with high average age. Interestingly, the model trained only on the graph features was able to perform relatively well on SBL despite that site having the highest age mean and performed worst on OHSU, the site with the least number of subjects. While this could have happened purely by chance, it is an interesting avenue for further insights about the nature of our model. Furthermore, these models were more robust to the effects of age as shown in Figure 2 suggesting age-related information is better captured in the graph theoretical features than the PCA transformation of the correlation matrix. However, this conclusion assumes that a quadratic regression fits the performance results of the graph only pipelines. The positive pipeline does not show a significant f-test result, thus perhaps a higher order regression would fit its Another intriguing observation about our results is that some thresholds and pipelines were better suited for classifying specific sites. This suggest that an ensemble model, such as training different models on each of these combinations and assigning the most selected label for each subject between these models, may achieve better performance on this dataset.
In this study we utilized a leave-one-site out cross validation approach in order to capture the multi-site nature of the ABIDE I dataset. We believe this approach leads to a better estimate of how this model will perform in the real world as it allows for better generalization across previously unseen imaging sites and protocols.
Using graph theoretical measures to analyze and classify neurological and neurodevelopmental diseases is not a novel idea. However, the practice of thresholding the connectivity matrix in order to produce a binary graph introduces an artificial bias toward some of the correlations. While this bias can be avoided by using the absolute value of the correlation matrix, this may disregard the important information that the positive correlation and anti-correlation networks hold. We have shown that, after running a GSR pipeline, focusing on the anti-correlation network may lead to better classification performance in the case of ASD vs HC. We also showed that by adding non-graph features to the anti-correlation graph features, classification metrics are improved suggesting that a different process may be needed to accurately capture the graph properties of both the positive and negative correlation networks.
One limitation in the present study may be the use of multiple threshold values in constructing the functional graph. Here we reported the average the results of all thresholds for each site. Another approach may be to report results from the threshold resulting in the highest AUC for each site. This would lead to a significant increase in model performance (achieving 70% weighted accuracy when using graph and PCA features). However, the negative correlation pipeline still outperformed the other pipeline.
Another limitation of this study was the inclusion of all age ranges in our dataset. As different sites have different demographics, this may affect the power of our model. Furthermore, identifying ASD is most important in earlier stages of life. The ABIDE I dataset included only a small number of children under the age of 10 (146 subjects) spread across multiple sites and none below the age of 5. Thus, the results presented in this study may not generalize to studies on younger children.
The ABIDE I data exhibits numerous inherent variability due to its multi-site nature. Here we presented a deep learning model that was able to navigate the intricacies of this data and generalize over multiple sites by using graph theoretical features. Our model was able to, on average, perform on-par with previously reported deep learning models using the same number of subjects.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

ETHICS STATEMENT
We used the data collected as part of the ABIDE database and complied with everything that they have asked to be included in any manuscript using that data. The original ethics statement form the (Di Martino et al., 2014) manuscript is as follows: All contributions were based on studies approved by local IRBs, and data were fully anonymized (removing all 18 HIPAA protected health information identifiers, and face information from structural images). All data distributed were visually inspected prior to release.

AUTHOR CONTRIBUTIONS
AK designed the experiments including writing the experiment code as well as writing the manuscript. RS provided advice and guidance on the experimental design and contributed to many manuscript revisions and final proofreading. Both authors contributed to the article and approved the submitted version.

FUNDING
This work was partially funded by the Department of Biomedical Engineering Research and Academic Excellence awards at the University of Calgary.