A Hybrid Machine Learning Method for Fusing fMRI and Genetic Data: Combining both Improves Classification of Schizophrenia

We demonstrate a hybrid machine learning method to classify schizophrenia patients and healthy controls, using functional magnetic resonance imaging (fMRI) and single nucleotide polymorphism (SNP) data. The method consists of four stages: (1) SNPs with the most discriminating information between the healthy controls and schizophrenia patients are selected to construct a support vector machine ensemble (SNP-SVME). (2) Voxels in the fMRI map contributing to classification are selected to build another SVME (Voxel-SVME). (3) Components of fMRI activation obtained with independent component analysis (ICA) are used to construct a single SVM classifier (ICA-SVMC). (4) The above three models are combined into a single module using a majority voting approach to make a final decision (Combined SNP-fMRI). The method was evaluated by a fully validated leave-one-out method using 40 subjects (20 patients and 20 controls). The classification accuracy was: 0.74 for SNP-SVME, 0.82 for Voxel-SVME, 0.83 for ICA-SVMC, and 0.87 for Combined SNP-fMRI. Experimental results show that better classification accuracy was achieved by combining genetic and fMRI data than using either alone, indicating that genetic and brain function representing different, but partially complementary aspects, of schizophrenia etiopathology. This study suggests an effective way to reassess biological classification of individuals with schizophrenia, which is also potentially useful for identifying diagnostically important markers for the disorder.

In the last few years there has been a growing interest in the use of machine learning algorithms for analyzing fMRI data. Machine learning algorithms can be used to train classifiers to decode stimuli, behaviors and other variables of interest from fMRI data (Haynes and Rees, 2006;O'Toole et al., 2007;Pereira et al., 2009). Demirci applied a projection pursuit technique to components obtained via independent component analysis (ICA) of fMRI activation maps, to classify individuals as being either schizophrenia patients or healthy controls (Demirci et al., 2008). Shinkareva et al. (2006) presented a unified feature selection and classification procedure to classify subjects into groups based on four dimensional spatio-temporal data. Zhang et al. (2005) applied the adaptive boosting algorithm (AdaBoost) (Freund and Schapire, 1997) to classify subjects into groups (drug-addicted subjects and healthy non-drug-using controls) based on the observed 3D brain images. Ford et al. (2003) used a Fisher linear discriminant analysis on the fMRI brain activation maps to extract spatial characteristics and to classify healthy controls versus patients with schizophrenia, Alzheimer's disease, and mild traumatic brain injury.
To date, limited work has been done on the use of genotypic information to help classify patients from controls, although Struyf et al. (2008) demonstrated that SVMs can distinguish bipolar and schizophrenia from normal control with a high accuracy by combing gene expression data with demographic and clinical data.

IntroductIon
Schizophrenia is a severe, chronic, brain disease that disrupts normal thinking, speech, and behavior. Schizophrenia diagnosis currently relies on clinical examination and the illness course, with many subcategories reflecting different aspects of this complex and likely biologically heterogeneous mental disease. Despite the diagnostic reliability achieved by quantifiable examination of overt psychiatric symptoms, researchers have also used biological indices in attempts to classify schizophrenia patients (Murray et al., 1992;Malaspina et al., 1998;Sponheim et al., 2001Sponheim et al., , 2003. Recently, there have been increasing efforts to utilize brain functional magnetic resonance imaging (fMRI) and examine genetic variation to study potential schizophrenia biomarkers, in order to better understand the pathology of schizophrenia. While most such studies focus on identifying associations between genetics and brain function in schizophrenia, we look at this problem from a different perspective, using biological and genetic information to help classify the disorder. We attempt to improve classification accuracy and provide preliminary data, suggesting that by combining biological and genetic information, we can best reflect the underlying pathophysiology, which ultimately may aid in the diagnosis of schizophrenia and its subcategories. We also predict that by achieving better classification, intrinsic connections between genetic variation and biological function can also be identified.
Many researchers now agree that schizophrenia may develop as a result of interplay between genetic predisposition (for example, inheriting certain susceptibility genes) and environmental exposure. While genetic factors play an important role in schizophrenia -persons who have immediate relatives with a history of schizophrenia have a significantly increased risk for developing the disorder over that of the general population. However, even monozygotic twins have only about 42% concordance for the disease (Lee et al., 2005). Environmental factors may well lead to subtle brain alterations that increase the risk of schizophrenia. Thus combining fMRI data (which captures brain function presumably reflecting both genetic and environmental influences) with genetic information, is potentially a useful way to help classify schizophrenia (Hariri and Weinberger, 2003;Pearlson and Folley, 2008;Calhoun et al., 2009;Liu et al., 2009;Potkin et al., 2009).
In this paper we present a supervised machine learning method to classify schizophrenia and control individuals that incorporates fMRI and SNP data. The method to fuse information from both modalities comprises four stages. At the first stage, a support vector machine based classifier ensemble (SVME) is constructed by using signature SNPs selected from a large SNP pool (SNP-SVME). At the second stage, a SVME is trained with a subset of voxels (Voxel-SVME). At the third stage, fMRI activation components obtained with ICA are used to construct a single SVM classifier (ICA-SVMC). Finally, at the fourth state the results obtained from the above three stages are combined into a single module using majority voting (Combined SNP-fMRI). We will first explain the data collection and preparation procedures, and describe the proposed method in detail. Then, we present the experimental results, followed by discussion and conclusion.

data and ExpErImEnts subjEcts
We investigated fMRI and SNP data from 40 subjects, 20 schizophrenia patients (age: 40.2 ± 9.8, three females) and 20 healthy controls (age 42.5 ± 15.5, eight females). All participants provided written, informed, IRB-approved consent at Hartford hospital. Patients met criteria for DSM-IV-TR schizophrenia based on the structured clinical interview for DSM IV (SCID; First et al., 1995) and review of the case file by a clinician. Healthy subjects were screened to ensure they were free from DSMIV Axis I or Axis II psychopathology assessed using the SCID (Spitzer et al., 1996) and also interviewed to determine that there was no history of psychosis in any first-degree relative. All selected subjects were Caucasian/non-Hispanic. Twenty chronic SZ patients were selected and 16 of them had available, contemporaneous positive and negative syndrome scale (PANSS) scores (Kay et al., 1987). For those 16 SZ patients, PANSS total score was 67.6 ± 30.0 (mean ± SD), positive symptom score 15.4 ± 4.1, and negative symptom score 14.5 ± 6.7. Seventeen SZ patients had available medication information. These were taking 26 types of first and second-generation antipsychotics in variable doses, with most patients taking more than one such drug. The most commonly prescribed medicines included olanzapine, risperidone, quetiapine, haloperidol, divalproex, escitalopram, and aripiprazole. snp data collEctIon and prEprocEssIng A saliva sample was obtained for each subject and DNA extracted. Genotyping was performed using the Illumina BeadArray™ platform and the GoldenGate™ assay (Oliphant et al., 2002;Fan et al., 2003). The PG Array of Genomas Inc. was used (the detailed composition has been published as a patent application, Ruano, 2006). The SNP array consists of 384 SNPs from 222 genes derived from six physiological systems: neurobiology, metabolism, cell proliferation, cardiovascular, inflammation, and cholesterol biochemistry. Over all systems, the following pathways were represented: insulin resistance, glucose metabolism, energy homeostasis, adiposity, apolipoproteins and receptors, fatty acid and cholesterol metabolism, lipases, receptors, cell signaling and transcriptional regulation, growth factors, drug metabolism, blood pressure, vascular signaling, endothelial dysfunction, coagulation and fibrinolysis, vascular inflammation, cytokines, and behavior (satiety).
Genotyping analysis software, GenCall, was used to cluster the intensities from the genotyping microarray into three clusters: AA, AB, and BB, without assuming dominant or recessive inheritance. On the basis of the GenCall score, a number between 0 and 1 indicating how close to the center of the cluster a sample lies, we chose a threshold to select only reliable genotype results. SNPs with a GenCall score of 0.25 or higher were selected, resulting in 367 SNPs. Genotypes are inherently categorical and can be represented as discrete numbers, e.g., 1 for one type of homozygous, 0 for heterozygous, and −1 for the other type of homozygous. In our study, each subject has a feature vector with 367 discrete numbers. fmrI data collEctIon and prEprocEssIng FMRI data were collected during performance of an auditory oddball task (Kiehl and Liddle, 2003), which consists of detecting an infrequent sound within a series of frequent sounds. The same auditory stimuli were used and found to be effective in eliciting fMRI BOLD patterns differentiating healthy controls from schizophrenia subjects (Kiehl et al., 2005). Auditory stimuli were presented to each participant by a computer stimulus presentation system via earphones. Subjects were presented with three types of sounds: target (1000 Hz with probability p = 0.1), novel (non-repeating random digital noises, p = 0.1), and standard (500 Hz, p = 0.8). Subjects were expected to respond and press a button with their right index finger every time they heard a target stimulus and not to respond to standard or novel sounds.
Scans were acquired at the Olin Neuropsychiatry Research Center at the Institute of Living on a Siemens Allegra 3 T dedicated head MRI scanner equipped with 40 mT/m gradients and a standard quadrature head coil. The functional scans were acquired using gradient-echo echo-planar-imaging with the following parameters (repeat time = 1.50 s, echo time = 27 ms, field of view = 24 cm, acquisition matrix = 64 × 64, flip angle = 70°, voxel size = 3.75 × 3.75 × 4 mm 3 , slice thickness = 4 mm, gap = 1 mm, 29 slices, ascending acquisition).
Six "dummy" scans were performed at the beginning to allow for longitudinal equilibrium, after which the paradigm was automatically triggered to start by the scanner. Data were preprocessed using the software package SPM2 (http://www.fil.ion.ucl.ac.uk/ spm/). Images were realigned using INRIalign -a motion correction algorithm unbiased by local signal changes (Freire and a SVME by a feature selective AdaBoost method (FSA) (Howe, 2003). The second stage is to construct a SVME for fMRI images with the optimal subset of voxels to reach the best classification performance. We first average neighboring voxels to reduce computation complexity, and then construct a SVME using the FSA on the averaged voxels. The third stage is to obtain a SVM classifier using independent components extracted from fMRI activation maps by ICA. The fourth and final stage is to combine the three classification models obtained from above stages into one model using majority voting.

snps subsEt sElEctIon and sVmE
Classifying schizophrenia based on genetic data is complicated by small-sample-size classification problems (Fukunaga, 1990). Genetic data have high dimensionality compared to the generally small number of available subject samples. The dimensionality N is often considered large if it is in the range of hundreds. Genetic data, however, can have hundreds of thousands of dimensions (genes or loci). Some genes are related to the schizophrenia classification task, but many are presumably irrelevant. The learning algorithms can be potentially confused by the irrelevant/redundant features and construct poor classifiers (Jain and Chandrasekaran, 1982). To address the small-sample-size classification problem, we propose a two-step algorithm to select informative genes from a high-dimensional space and generate a classifier ensemble through SVM. The first step is a filter that removes most irrelevant features and selects a candidate SNP subset from whole SNP pool using FSFS. The second step combines a SNP selection into AdaBoost SVM ensemble algorithm to construct SVME with signature SNP subset. Mangin, 2001). Data were spatially normalized into the standard Montreal Neurological Institute space (Friston et al., 1995), resliced to 3 × 3 × 3 mm 3 , and spatially smoothed with a 10 × 10 × 10 mm 3 Gaussian kernel. Data for each participant were analyzed by multiple regression incorporating regressors for the novel, target, and standard and their temporal derivatives plus an intercept term. The target-related contrast images were used in this study. Finally, we used a mask based upon one-sample t-test against zero activation to select meaningful voxels. This results in a size of 7,060 voxels in each fMRI image.

mEthods thE hybrId machInE lEarnIng mEthod
A two-class supervised learning problem can be written as a formula, 1 1 1 2  with m samples (subjects in this study). Each sample x i has d features and a class label y i . From a set of training samples, the machine learning algorithm establishes a classifier, which represents a hypothesis, h. Given unseen samples, the classifier predicts the corresponding y value. An ensemble method constructs a set of classifiers {h 1 , …, h T }, chooses a set of weights {α 1 , …, α T } and build a weighted average classifier H( The flowchart of the proposed supervised machine learning method can be seen in Figure 1. There are four stages to fuse fMRI and genetic data, and to classify schizophrenia. The first stage is to select signature SNP loci and construct a SVME for SNPs, termed SNP-SVME. Two steps are involved: (1) to select a subset of candidate SNPs from whole SNPs pool by the forward sequential feature selection method (FSFS) (Liu, 2005); (2) to construct As shown above, the FSA algorithm runs for T iterations, and the final classification output of H is a weighted T individual classifiers. Initially, all weights of training samples are set equally. On each round the weights of misclassified samples are increased so that the algorithm forces classifiers to focus on those samples in the training set.
Furthermore, within each iteration cycle, the FSA algorithm ranks all features with training error rate, and selects l features with the lowest training error rate. The number l is decided based on the leave one out (LOO) SVMs performance with the weighted training samples used in this iteration. Thus the FSA algorithm selects the feature subset that contains the most discriminating information on each round and trains a classifier based on weighted training samples with the selected features. Accuracy and diversity of individual classifiers critically influence the classification performance of ensemble methods. The FSA increases the diversity among the classifiers by allowing a flexible feature space, which in turn enhances the overall performance of SVME. Valentini and Dietterich (2002) analyzed bias-variance decomposition of the error in SVM, and showed that the bias-variance decomposition offers a rationale to develop ensemble methods using SVMs as base learners. In this paper, the kernel function of SVM is the radial basis function (RBF) kernel. SVM is a statistical learning method based on the structure risk minimization principle that has been shown to be very efficient in pattern recognition applications (Vapnik, 2000). However, the classification performance of SVM heavily depends on a proper setting of parameters. The RBF-SVM has two parameters: one is the RBF kernel parameter σ, and the other is C, which controls the trade-off between training error and the margin. On each round of the FSA algorithm, we compute the optimal parameters of RBF-SVM by evaluating its accuracy and diversity with the weighted training dataset through the bias-variance decomposition of the error in SVM (Valentini and Dietterich, 2002).

VoxEls sElEctIon and sVmE
The goal of this stage is to select informative voxels to aid in diagnostic classification. As mentioned above, the fMRI image has voxels with 7060 non-zero meaningful voxels. The amount of non-zero voxels is very large compared to the number of samples. It is necessary to decrease the dimensionality while retaining the group discrimination information. First, we merge the 3 × 3 × 3 non-zero neighboring voxels by averaging. Thus the resultant images have 261 large voxels. In the second step, we apply FSA algorithm described in section "SNPs subset selection and SVME" to further select informative voxels and construct SVME. At each FSA iteration, voxels ranked with high discriminative values are used for training a SVM classifier. The final decision is a weighted ensemble of individual classifiers.

Ica componEnt ExtractIon
In prior research, ICA has been applied to the analysis of fMRI data to discover hidden components presenting brain activation and characterize their spatial locations in healthy control subjects and patients with schizophrenia (Calhoun et al., 2004;Sui et al., 2009).

Forward sEquEntIal FEaturE sElEctIon (FsFs) mEthod
The FSFS algorithm is a good choice for irrelevancy removal. It applies independent evaluation criteria without involving any learning algorithm. It does not inherit any bias of a learning algorithm and it is also computationally efficient (Liu, 2005). The FSFS algorithm starts the search from an empty SNP set. As the search proceeds, SNPs are added into the SNP subset one at a time. On each round, the best SNP for classification among unselected ones is chosen based on a distance measure. Distance measures are also known as separability, divergence, or discrimination measures. We try to find the SNP that can separate the patients and healthy controls as far as possible. The distance measure used in this paper is Mahalanobis distance. The SNP subset grows until it reaches the full set of original SNPs. A rank list is computed according to how early a SNP is added into the list. Then a certain number of SNPs are selected to construct a candidate SNP subset for second step. Both the prior knowledge of the SNP dataset and experience are used to decide how many SNPs are selected. In order to keep more informative SNPs, we select about top 40% SNPs in the rank list to construct a candidate SNP subset. The candidate SNP subset is much smaller than original SNP set, but still contains unrelated SNPs which need to be removed.

FEaturE sElEctIVE adaboost (Fsa) mEthod
The second step is constructing a SVME by the FSA method. AdaBoost proposed by Freund and Schapire (1997) can be used in conjunction with any other iterative learning algorithms to improve their performance. Here, we use AdaBoost with SVM to build a SVM classifier ensemble. In addition, we modify AdaBoost to add a feature selection function, then propose a feature selective AdaBoost method. The FSA algorithm aims at training classifiers to get the best performance and selecting features with the best discriminating power simultaneously. The FSA algorithm is given below comparison, we also trained the SVMC with all 367 SNPs and 7060 Voxels. The LOO accuracy are 0.4 (367 SNPs) and 0.675 (7060 Voxels). The experiment results suggest that SNP and voxel selection is necessary. At the first stage, we examined the SNPs database using the twostep method described in section "The hybrid machine learning method". After the most irrelevant SNPs filtered out from whole SNPs dataset using FSFS, 150 SNPs were selected. These 150 SNPs were then used as input features of the FSA algorithm. The number of iterations for FSA was set to 20 empirically since the performance was saturated after 20 classifiers. At each iteration the algorithm selected a certain number of SNPs from 150 SNPs and trained a SVM classifier. The number of SNPs selected in each iteration was estimated by the LOO algorithm on weighted training dataset. Those SNPs having more discrimination information are expected to have a high frequency of being selected. The importance of each SNP to the classification task can be denoted by the ratio of the number of times each SNPs selected over the number of iterations of FSA. Figure 2 shows the importance of individual SNP, and the most important 15 SNPs are listed in Table 2.
The basic ICA model defines a generative model for the observed data, with a goal of identifying hidden independent components from linearly mixed observations. In above equation, O is an observation matrix that can be composed of measurements from MRI images. S contains the independent components, which consists of unknown sources such as brain activation networks. A is a linear mixing matrix, relating the sources to the mixed measurements. W is an unmixing matrix. If W equals the inverse of A, then the Z, the estimated component matrix, is equivalent to S, the source matrix. There are many ICA algorithms based on different independence criteria. The ICA algorithm we use here is the infomax algorithm which attempts to find the W matrix through maximizing an entropy function (Bell and Sejnowski, 1995;Cardoso, 1997). And we use modified Akaike information criterion (AIC) method proposed by Li et al. to estimate the correct number of components (Akaike, 1974;Li et al., 2007). At this stage, there are five components extracted from the fMRI image of each sample. These five components are used as classification features to train a linear SVM classifier.

classIFIcatIon combInatIon
The fourth and final stage combines the results from the above three stages and makes a final decision via majority voting.

classIFIcatIon ExpErImEnts and rEsults
We next applied the hybrid machine learning method to the problem of separating patients from controls. All statistical results of our experiments are based on the LOO cross-validation method. Thirty-nine subjects were used for training, while one subject was used for testing. A total of 40 training-testing sets were implemented. The performance measures used in this paper are specificity, sensitivity, and accuracy. The test output of our method can be positive (patient) and negative (control). A true positive means a patient correctly diagnosed as a patient, a false positive means healthy people wrongly identified as sick. True negative means healthy people correctly identified as healthy. A false negative means sick people wrongly identified as healthy. The specificity, sensitivity and accuracy are defined as below:  Table 1. For decision of SNP-SVME is especially important because this model makes the decision based on a totally different data source. A necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is that the classifiers are accurate (better than random guessing) and the errors are at least somewhat uncorrelated (Dietterich, 2000). The proposed method meets the requirement by constructing individual classification models from different perspectives and different data source. From data shown in Table 1, we know that the proposed fourstage method achieves better classification accuracy by combining genetic data and fMRI data than using either alone. The results indicate that even though abnormal brain function and genetic variation are both related to a clinical diagnosis of schizophrenia, they reflect different aspects of schizophrenia etiopathology, and cannot replace each other in terms of reflecting the disease. Overall, 87% accuracy was achieved, suggesting that combining genetic and brain functional information best represents the majority of symptomatic information used currently to arrive at a clinical diagnosis. For misclassified cases, many reasons may be involved including the small size of the SNP array, the rather simple and non-specific brain activation patterns reflected in the auditory stimuli paradigm and the sub-optimal sensitivity of the model. One observation worthy of note is that two patients were consistently misclassified by all classification models. This may be due to inaccuracy in all models, or the fact that there is discrepancy between biological/genetic and clinical interviewbased diagnosis.
The schizophrenia patients used in this study were chronic and all taking antipsychotic medication. Aware of the potential effects of such medication on brain function, we assume that these drugs had a common, general effect on all 20 patients, since most patients were using multiple medicines (1-5 types of medicines, and a total of 26 types of medicines were prescribed) at various dosages. This study is a proof-of-concept with a small sample size and limited numbers of SNPs, to demonstrate the power of combining genetics with brain function applied in the classification framework. For a full validation, the proposed method will need to be applied to a much larger group of subjects, including multiple SZ subcategories (including schizo-affective disorder), multiple clinical treatment group (including current-naïve subjects), and using more SNPs. Future work will also focus on early differentiation of sub-groups (which in the case of prodromal subjects can take weeks to months), prediction of treatment response, or early diagnosis at the time of first presentation.

gEnE sElEctIon
As shown in Table 2, the top 15 SNPs ranked by the proposed method were located in 14 genes: Among them, some are wellknown putative schizophrenia susceptibility genes, such as COMT (Handoko et al., 2005;Shifman et al., 2006;Nicodemus et al., 2007), DISC1 (St Clair et al., 1990;Hodgkinson et al., 2004;Cannon et al., 2005;Callicott et al., 2005;Nicodemus et al., 2007;Saetre et al., 2008;Liu et al., 2009), MTHFR (Godfrey et al., 1990;Zintzaras, 2006;Gilbody et al., 2007;Roffman et al., 2008), and HTR3B (Maziade et al., 1995;Levinson et al., 1998;Gurling et al., 2001;Frank et al., 2004;Yamada et al., 2006). Some are brain related At the second stage, the FSA selected a certain number of voxels that containing the most discriminating information from 261 large voxels and trained a SVM at each iteration. The number of voxels to be selected at each iteration was estimated by the LOO algorithm with weighted training dataset used in that iteration. The importance of each voxel to the classification task can be denoted by the ratio of the number of times each voxel selected over the number of iterations of FSA. Figure 3 shows the location of selected voxels in the brain and their importance. The volume of each region represents the importance of voxels. Yellow indicates the highly important region, followed by orange and red. Table 3 lists the anatomical brain regions of selected voxels.

dIscussIon classIFIcatIon rEsults
In the method described, three kinds of classification information were extracted from genetic and fMRI data in order to classify schizophrenia and healthy control subjects using three models: SNP-SVME, Voxel-SVME, and ICA-SVMC. Among them, Voxel-SVME and ICA-SVMC both extract information from fMRI data, while only SNP-SVME extracts classification information from SNP data. FMRI data have more weight than SNP data in the proposed method for two reasons: (a) fMRI images contain more discriminating information than SNP data, due to the fact that brain function is logically closer to the expression of mental illness symptoms and as expected the fMRI classification models performed better than SNP classification model in our experiments; (b) although Voxel-SVME and ICA-SVMC models are both constructed with fMRI data, the two models present discriminating information from different perspectives. This does not imply that the SNP classification model is unnecessary. In fact, when the two fMRI models disagree with each other, the

conclusIon
We propose a hybrid machine learning method for fusing fMRI and genetic data to separate individuals with schizophrenia from healthy controls. Experimental results showed that better classification accuracy is achieved by combining genetic and fMRI data than using either alone, suggesting that genetic and brain function represent different aspects, partially complementary to each other, of schizophrenia beauty of pathology. The method is able to extract the discriminating information to classify schizophrenia effectively and is potentially useful to identify important diagnostic markers for schizophrenia. Given the limited sample size and relatively small SNP array, this manuscript presents preliminary results and we are now in the process of attempting to replicate these findings in an independent sample. genes including GAD2 (De et al., 2004;Arai et al., 2009), SLC6A4 (Shi et al., 2008;Zaboli et al., 2008), and ABCB1 and ABCC (Bozina et al., 2008), possible candidates for schizophrenia susceptibility.

VoxEl sElEctIon
From Table 3, the brain regions contributing most to classify schizophrenia patients and healthy controls consist of inferior, middle and medial frontal gyri, cingulate gyrus, superior temporal gyrus, and precuneus. Since our input fMRI data were contrast images (target stimulus vs. standard stimulus) collected in the auditory oddball test, it is reasonable that voxels in these regions were selected. The results are in accordance with previous structural and functional brain findings (Barta et al., 1990;Pearlson et al., 1996;Calhoun et al., 2004;Cavanna and Trimble, 2006;Garrity et al., 2007).