Deep learning for neuroimaging: a validation study

Deep learning methods have recently made notable advances in the tasks of classification and representation learning. These tasks are important for brain imaging and neuroscience discovery, making the methods attractive for porting to a neuroimager's toolbox. Success of these methods is, in part, explained by the flexibility of deep learning models. However, this flexibility makes the process of porting to new areas a difficult parameter optimization problem. In this work we demonstrate our results (and feasible parameter ranges) in application of deep learning methods to structural and functional brain imaging data. These methods include deep belief networks and their building block the restricted Boltzmann machine. We also describe a novel constraint-based approach to visualizing high dimensional data. We use it to analyze the effect of parameter choices on data transformations. Our results show that deep learning methods are able to learn physiologically important representations and detect latent relations in neuroimaging data.


Introduction
One of the main goals of brain imaging and neuroscience-and, possibly, of most natural sciencesis to improve understanding of the investigated system based on data.In our case, this amounts to inference of descriptive features of brain structure and function from non-invasive measurements.Brain imaging field has come a long way from anatomical maps and atlases towards data driven feature learning methods, such as seed-based correlation [2], canonical correlation analysis [33], and independent component analysis (ICA) [1,24].These methods are highly successful in revealing known brain features with new details [3] (supporting their credibility), in recovering features that differentiate patients and controls [28] (assisting diagnosis and disease understanding), and starting a "resting state" revolution after revealing consistent patters in data from uncontrolled resting experiments [29,35].Classification is often used merely as a correctness checking tool, as the main emphasis is on learning about the brain.A perfect oracle that does not explain its conclusions would be useful, but mainly to facilitate the inference of the ways the oracle draws these conclusions.
As an oracle, deep learning methods are breaking records taken over the areas of speech, signal, image, video and text mining and recognition by improving state of the art classification accuracy by, sometimes, more than 30% where the prior decade struggled to obtain a 1-2% improvements [19,21].What differentiates them from other classifiers, however, is the automatic feature learning from data which largely contributes to improvements in accuracy.Presently, this seems to be the closest solution to an oracle that reveals its methods -a desirable tool for brain imaging.
Another distinguishing feature of deep learning is the depth of the models.Based on already acceptable feature learning results obtained by shallow models-currently dominating neuroimaging field-it is not immediately clear what benefits would depth have.Considering the state of multimodal learning, where models are either assumed to be the same for analyzed modalities [26] or cross-modal relations are sought at the (shallow) level of mixture coefficients [23], deeper models better fit the intuitive notion of cross-modality relations, as, for example, relations between genetics and phenotypes should be indirect, happening at a deeper conceptual level.
In this work we present our recent advances in application of deep learning methods to functional and structural magnetic resonance imaging (fMRI and sMRI).Each consists of brain volumes but for sMRI these are static volumes-one per subject/session,-while for fMRI a single subject dataset is comprised of multiple volumes capturing the changes during an experimental session.Our goal is to validate feasibility of this application by a) investigating if a building block of deep generative models-a restricted Boltzmann machine (RBM) [17]-is competitive with ICA (a representative model of its class) (Section 2); b) examining the effect of the depth in deep learning analysis of structural MRI data (Section 3.3); and c) determining the value of the methods for discovery of latent structure of a large-scale (by neuroimaging standards) dataset (Section 3.4).The measure of feature learning performance in a shallow model (a) is comparable with existing methods and known brain physiology.However, this measure cannot be used when deeper models are investigated.As we further demonstrate, classification accuracy does not provide the complete picture either.To be able to visualize the effect of depth and gain an insight into the learning process, we introduce a flexible constraint satisfaction embedding method that allows us to control the complexity of the constraints (Section 3.2).Deliberately choosing local constraints we are able to reflect the transformations that the deep belief network (DBN) [15] learns and applies to the data and gain additional insight.

A shallow belief network for feature learning
Prior to investigating the benefits of depth of a DBN in learning representations from fMRI and sMRI data, we would like to find out if a shallow (single hidden layer) model-which is the RBMfrom this family meets the field's expectations.As mentioned in the introduction, a number of methods are used for feature learning from neuroimaging data: most of them belong to the single matrix factorization (SMF) class.We do a quick comparison to a small subset of SMF methods on simulated data; and continue with a more extensive comparison against ICA as an approach trusted in the neuroimaging field.Similarly to RBM, ICA relies on the bipartite graph structure, or even is an artificial neural network with sigmoid hidden units as is in the case of Infomax ICA [1] that we compare against.Note the difference with RBM: ICA applies its weight matrix to the (shorter) temporal dimension of the data imposing independence on the spatial dimension while RBM applies its weight matrix (hidden units "receptive fields") to the high dimensional spatial dimension instead (Figure 2).

A restricted Boltzmann machine
A restricted Boltzmann machine (RBM) is a Markov random field that models data distribution parameterizing it with the Gibbs distribution over a bipartite graph between visible v and hidden variables h [10]: is the normalization term (the partition function) and E(v, h) is the energy of the system.Each visible variable in the case of fMRI data represents a voxel of an fMRI scan with a real-valued and approximately Gaussian distribution.In this case, the energy is defined as: where a j and b j are biases and σ j is the standard deviation of a parabolic containment function for each visible variable v j centered on the bias a j .In general, the parameters σ i need to be learned along with the other parameters.However, in practice normalizing the distribution of each voxel to have zero mean and unit variance is faster and yet effective [27].A number of choices affect the quality of interpretation of the representations learned from fMRI by an RBM.Encouraging sparse features via the L 1 -regularization: λ W 1 (λ = 0.1 gave best results) and using hyperbolic tangent for hidden units non-linearity are essential settings that respectively facilitate spatial and temporal interpretation of the result.The weights were updated using the truncated Gibbs sampling method called contrastive divergence (CD) with a single sampling step (CD-1).Further information on RBM model can be found in [16,17].In this section we summarize our comparisons of RBM with SMF models-including Infomax ICA [1], PCA [14], sparse PCA (sPCA) [37], and sparse NMF (sNMF) [18]-on synthetic data with known spatial maps generated to simulate fMRI.

Synthetic data
Figure 1a shows the correlation of spatial maps (SM) and time course (TC) estimates to the ground truth for RBM, ICA, PCA, sPCA, and sNMF.Correlations are averaged across all sources and datasets.RBM and ICA showed the best overall performance.While sNMF also estimated SMs well, it showed inferior performance on TC estimation, likely due to the non-negativity constraint.
Based on these results and the broad adoption of ICA in the field, we focus on comparing Infomax ICA and RBM.
Figure 1b shows the full set of ground truth sources along with RBM and ICA estimates for a single representative dataset.SMs are thresholded and represented as contours for visualization.Results over all synthetic datasets showed similar performance for RBM and ICA (Figure 1c), with a slight advantage for ICA with regard to SM estimation, and a slight advantage for RBM with regards to TC estimation.RBM and ICA also showed comparable performance estimating cross correlations also called functional network connectivity (FNC).Data used in this work comprised of task-related scans from 28 (five females) healthy participants, all of whom gave written, informed, IRB-approved consent at Hartford Hospital and were compensated for participation 1 .All participants were scanned during an auditory oddball task (AOD) involving the detection of an infrequent target sound within a series of standard and novel sounds2 .

An fMRI data application
Scans were acquired at the Olin Neuropsychiatry Research Center at the Institute of Living/Hartford Hospital on a Siemens Allegra 3T dedicated head scanner equipped with 40 mT /m gradients and a standard quadrature head coil [4,9].The AOD consisted of two 8-min runs, and 249 scans (volumes) at 2 second TR (0.5 Hz sampling rate) were used for the final dataset.Data were post-processed using the SPM5 software package [12], motion corrected using INRIalign [11], and subsampled to 53 × 63 × 46 voxels.The complete fMRI dataset was masked below mean and the mean image across the dataset was removed, giving a complete dataset of size 70969 voxels by 6972 volumes.
Each voxel was then normalized to have zero mean and unit variance.The RBM was constructed using 70969 Gaussian visible units and 64 hyperbolic tangent hidden units.The hyper parameters (0.08 from the searched range) for learning rate and λ (0.1 from the searched range [1 × 10 −2 ,1 × 10 −1 ]) for L 1 weight decay were selected as those that showed a reduction of reconstruction error over training and a significant reduction in span of the receptive fields respectively.Parameter value outside the ranges either resulted in unstable or slow learning ( ) or uninterpretable features (λ).The RBM was then trained with a batch size of 5 for approximately 100 epochs to allow for full convergence of the parameters.
After flipping the sign of negative receptive fields, we then identified and labeled spatially distinct features as corresponding to brain regions with the aid of AFNI [5] excluding features which had a high probability of corresponding to white matter, ventricles, or artifacts (eg.motion, edges).
We normalized the fMRI volume time series to mean zero and used the trained RBM in feed-forward mode to compute time series for each fMRI feature.This was done to better compare to ICA, where the mean is removed in PCA preprocessing.
The work-flow is outlined in Figure 2, while Figure 3 shows comparison of resulting features with those obtained by Infomax ICA.In general, RBM performs competitively with ICA, while providing-perhaps, not surprisingly due to the used L 1 regularization-sharper and more localized features.While we recognize that this is a subjective measure we list more features in Figure S2 of Section 5 and note that RBM features lack negative parts for corresponding features.Note, that in the case of L 1 regularized weights RBM algorithms starts to resemble some of the ICA approaches (such as the recent RICA by Le at al. [20]), which may explain the similar performance.However, the differences and possible advantages are the generative nature of the RBM and no enforcement of component orthogonality (not explicit at the least).Moreover, the block structure of the correlation matrix (see below the Supplementary material section) of feature time courses provide a grouping that is more physiologically supported than that provided by ICA.For example, see Figure S1 in the supplementary material section below.Perhaps, because ICA working hard to enforce spatial independence subtly affects the time courses and their cross-correlations in turn.We have observed comparable running times of the (non GPU) ICA (http://www.nitrc.org/projects/gift)and a GPU implementation of the RBM (https://github.com/nitishsrivastava/deepnet).

Validating the depth effect
Since the RBM results demonstrate a feature-learning performance competitive with the state of the art (or better), we proceed to investigating the effects of the model depth.To do that we turn from fMRI to sMRI data.As it is commonly assumed in the deep learning literature [22] the depth is often improving classification accuracy.We investigate if that is indeed true in the sMRI case.Structural data is convenient for the purpose as each subject/session is represented only by a single volume that has a label: control or patient in our case.Compare to 4D data where hundreds of volumes belong to the same subject with the same disease state.

A deep belief network
A DBN is a sigmoidal belief network (although other activation functions may be used) with an RBM as the top level prior.The joint probability distribution of its visible and hidden units is parametrized as follows: where l is the number of hidden layers, P (h l−1 , h l ) is an RBM, and P (h i |h i+1 ) factor into individual conditionals: The important property of DBN for our goals of feature learning to facilitate discovery is its ability to operate in generative mode with fixed values on chosen hidden units thus allowing one to investigate the features that the model have learned and/or weighs as important in discriminative decisions.We, however, not going to use this property in this section, focusing instead on validating the claim that a network's depth provides benefits for neuroimaging data analysis.And we will do this using discriminative mode of DBN's operation as it provides an objective measure of the depth effect.
DBN training splits into two stages: pre-training and discriminative fine tuning.A DBN can be pre-trained by treating each of its layers as an RBM-trained in an unsupervised way on inputs from the previous layer-and later fine-tuned by treating it as a feed-forward neural network.The latter allows supervised training via the error back propagation algorithm.We use this schema in the following by augmenting each DBN with a soft-max layer at the fine-tuning stage.

Nonlinear embedding as a constraint satisfaction problem
A DBN and an RBM operate on data samples, which are brain volumes in the fMRI and sMRI case.A five-minute fMRI experiment with 2 seconds sampling rate yields 150 of these volumes per subject.For sMRI studies number of participating subjects varies but in this paper we operate with a 300 and a 3500 subject-volumes datasets.Transformations learned by deep learning methods do not look intuitive in the hidden node space and generative sampling of the trained model does not provide a sense if a model have learned anything useful in the case of MRI data: in contrast to natural images, fMRI and sMRI images do not look very intuitive.Instead, we use a nonlinear embedding method to control whether a model learned useful information and to assist in investigation of what have it, in fact, learned.
One of the purposes of an embedding is to display a complex high dimensional dataset in a way that is i) intuitive, and ii) representative of the data sample.The first requirement usually leads to displaying data samples as points in a 2-dimensional map, while the second is more elusive and each approach addresses it differently.Embedding approaches include relatively simple random linear projections-provably preserving some neighbor relations [6]-and a more complex class of nonlinear embedding approaches [30,32,34,36].In an attempt to organize the properties of this diverse family we have aimed at representing nonlinear embedding methods under a single constraint satisfaction problem (CSP) framework (see below).We hypothesize that each method places the samples in a map to satisfy a specific set of constraints.Although this work is not yet complete, it proven useful in our current study.We briefly outline the ideas in this section to provide enough intuition of the method that we further use in Section 3.
Since we can control the constraints in the CSP framework, to study the effect of deep learning we choose them to do the least amount of work-while still being useful-letting the DBN do (or not) the hard part.A more complicated method such as t-SNE [36] already does complex processing to preserve the structure of a dataset in a 2D map -it is hard to infer if the quality of the map is determined by a deep learning method or the embedding.While some of the existing method may have provided the "least amount of work" solutions as well we chose to go with the CSP framework.It explicitly states the constraints that are being satisfied and thus lets us reason about deep learning effects within the constraints, while with other methods-where the constraints are implicit-this would have been harder.
A constraint satisfaction problem (CSP) is one requiring a solution that satisfies a set of constraints.

One of the well known examples is the boolean satisfiability problem (SAT)
. There are multiple other important CSPs such as the packing, molecular conformations, and, recently, error correcting codes [7].Freedom to setup per point constraints without controlling for their global interactions makes a CSP formulation an attractive representation of the nonlinear embedding problem.Pursuing this property we use the iterative "divide and concur" (DC) algorithm [13] as the solver for our representation.In DC algorithm we treat each point on the solution map as a variable and assign a set of constraints that this variable needs to satisfy (more on these later).Then each points gets a "replica" for each constraint it is involved into.Then DC algorithm alternates the divide and concur projections.The divide projection moves each "replica" points to the nearest locations in the 2D map that satisfy the constraint they participate in.The concur projection concurs locations of all "replicas" of a point by placing them at the average location on the map.The key idea is to avoid local traps by combining the divide and concur steps within the difference map [8].A single location update is represented by: where P d (•) and P c (•) denote the divide and concur projections and β is a user-defined parameter.
While the concur projection will only differ by subsets of "replicas" across different methods representable in DC framework, the divide projection is unique and defines the algorithm behavior.In this paper, we choose a divide projection that keeps k nearest neighbors of each point in the higher dimensional space also its neighbors in the 2D map.This is a simple local neighborhood constraint that allows us to assess effects of deep learning transformation leaving most of the mapping decisions to the deep learning.
Note, that for a general dataset we may not be able to satisfy this constraint: each point has exactly the same neighbors in 2D as in the original space (and this is what we indeed observe).The DC algorithm, however, is only guaranteed to find the solution if it exists and oscillates otherwise.
Oscillating behavior is detectable and may be used to stop the algorithm.We found informative watching the 2D map in dynamics, as the points that keep oscillating provide additional information into the structure of the data.Another practically important feature of the algorithm: it is deterministic.Given the same parameters (β and the parameters of P d (•)) it converges to the same solution regardless of the initial point.If each of the points participates in each constraint then complexity of the algorithm is quadratic.With our simple k neighborhood constraints it is O(kn), for n samples/points.We use a combined data from four separate schizophrenia studies conducted at Johns Hopkins University (JHU), the Maryland Psychiatric Research Center (MPRC), the Institute of Psychiatry, London, UK (IOP), and the Western Psychiatric Institute and Clinic at the University of Pittsburgh (WPIC) (the data used in Meda et al. [25]).The combined sample comprised 198 schizophrenia patients and 191 matched healthy controls and contained both first episode and chronic patients [25].At all sites, whole brain MRIs were obtained on a 1.5T Signa GE scanner using identical parameters and software.Original structural MRI images were segmented in native space and the resulting gray and white matter images then spatially normalized to gray and white matter templates respectively to derive the optimized normalization parameters.These parameters were then applied to the whole brain structural images in native space prior to a new segmentation.The obtained 60465 voxel gray matter images were used in this study.Figure 4 shows example orthogonal slice views of the gray matter data samples of a patient and a healthy control.

A schizophrenia structural MRI dataset
The main question of this Section is to evaluate the effect of the depth of a DBN on sMRI.To answer this question, we investigate if classification rates improve with the depth.For that we sequentially investigate DBNs of 3 depth.From RBM experiments we have learned that even with a larger number of hidden units (72, 128 and 512) RBM tends to only keep around 50 features driving the rest to zero.Classification rate and reconstruction error still slightly improves, however, when the number of hidden units increases.These observations affected our choice of 50 hidden units of the first two layers and 100 for the third.Each hidden unit is connected to all units in the previous layer which results in an all to all connectivity structure between the layers, which is a more common and conventional approach to constructing these models.Note, larger networks (up to double the umber of units) lead to similar results.We pre-train each layer via an unsupervised RBM and discriminatively fine-tune models of depth 1 (50 hidden units in the top layer), 2 (50-50 hidden units in the first and the top layer respectively), and 3 (50-50-100 hidden units in the first, second and the top layer respectively) by adding a softmax layer on top of each of these models and training via the back propagation.
We estimate the accuracy of classification via 10-fold cross validation on fine-tuned models splitting the 389 subject dataset into 10 approximately class-balanced folds.We train the rbf-kernel SVM, logistic regression and a k-nearest neighbors (knn) classifier using activations of the top-most hidden layers in fine-tuned models to the training data of each fold as their input.The testing is performed likewise but on the test data.We also perform the same 10-fold cross validation on the raw data.All models demonstrate a similar trend when the accuracy only slightly increases from depth-1 to depth-2 DBN and then improves significantly.Table 1 supports the general claim of deep learning community about improvement of classification rate with the depth even for sMRI data.Improvement in classification even for the simple knn classifier indicates the character of the transformation that the DBN learns and applies to the data: it may be changing the data manifold to organize classes by neighborhoods.Ideally, to make general conclusion about this transformation we need to analyze several representative datasets.However, even working with the same data we can have a closer view of the depth effect using the method introduced in Section 3.2.Although it may seem that the DBN does not provide significant improvements in sMRI classification from depth-1 to depth-2 in this model, it keeps on learning potentially useful transformaions of the data.We can see that using our simple local neighborhoodbased embedding.Figure 5 displays 2D maps of the raw data, as well as the depth 1, 2, and 3 activations (of a network trained on 335 subjects): the deeper networks place patients and control groups further apart.Additionally, Figure 5 displays the 54 subjects that the DBN was not train on.These hold out subjects are also getting increased separation with depth.This DBN's behavior is potentially useful for generalization, when larger and more diverse data become available.
Our new mapping method has two essential properties to facilitate the conclusion and provide confidence in the result: its already mentioned local properties and the deterministic nature of the algorithm.The latter leads to independence of the resulting maps from the starting point.The map only depends on the models parameter k-the size of the neighborhood-and the data.In this section we focus on sMRI data collected from healthy controls and Huntington disease (HD) patients as part of the PREDICT-HD project (www.predict-hd.net).Huntington disease is a genetic neurodegenerative disease that results in degeneration of neurons in certain areas of the brain.The project is focused on identifying the earliest detectable changes in thinking skills, emotions and brain structure as a person begins the transition from health to being diagnosed with Huntington disease.We would like to know if deep learning methods can assist in answering that question.

A large-scale Huntington disease data
For this study T1-weighted scans were collected at multiple sites (32 international sites), representing multiple field strengths (1.5T and 3.0T) and multiple manufactures (Siemens, Phillips, and GE).The 1.5T T1 weighted scans were an axial 3D volumetric spoiled-gradient echo series (≈ 1 × 1 × 1.5 mm voxels), and the 3.0T T1 weighted scans were a 3D Volumetric MPRAGE series (≈ 1 × 1 × 1 mm voxels).The images were segmented in the native space and the normalized to a common template.After correlating the normalized gray matter segmentation with the template and eliminating poorly correlating scans we obtain a dataset of 3500 scans, where 2641 were from patients and 859 from healthy controls.We have used all of the scans in this imbalanced sample to pre-train and fine tune the same model architecture (50-50-100) as in Section 3.3 for all three depths3 .Table 2 lists the average F-score values for both classes at the raw data and all depth levels.Note the drop from the raw data and then a recovery at depth 3. The limited capacity of levels 1 and 2 has reduced the network ability to differentiate the groups but representational capacity of depth 3 network compensates for the initial bottleneck.This, confirms our previous observation on the depth effect, however, does not yet help the main question of the PREDICT-HD study.Note, however, while Table 1 in the previous section evaluates generalization ability of the DBN, Table 2 here only demonstrates changes in DBN's representational capacity with the depth as we use no testing data.To further investigate utility of the deep learning approach for scientific discovery we again augment it with the embedding method of Section 3.2.Figure 7 shows the map of 3500 scans of HD patients and healthy controls.Each point on the map is an sMRI volume, shown in Figures 6 and 7.Although we have used the complete data to train the DBN, discriminative fine-tuning had access only to binary label: control or patient.In addition to that, we have information about severity of the disease from low to high.We have color coded this information in Figure 7 from bright yellow (low) through orange (medium) to red (high).The network 4discriminates the patients by disease severity which results in a spectrum on the map.Note, that neither t-SNE (not shown), nor our new embedding see the spectrum or even the patient groups in the raw data.This is a important property of the method that may help support its future use in discovery of new information about the disease.

Conclusions
Our investigations show that deep learning has a high potential in neuroimaging applications.Even the shallow RBM is already competitive with the model routinely used in the field: it produces physiologically meaningful features which are (desirably) highly focal and have time course cross correlations that connect them into meaningful functional groups (Section 5).The depth of the DBN does indeed help classification and increases group separation.This is apparent on two sMRI datasets collected under varying conditions, at multiple sites each, from different disease groups, and pre-processed differently.This is a strong evidence of DBNs robustness.Furthermore, our study shows a high potential of DBNs for exploratory analysis.As Figure 7 demonstrates, DBN in conjunction with our new mapping method can reveal hidden relations in data.We did find it difficult initially to find workable parameter regions, but we hope that other researchers won't have this difficulty starting from the baseline that we provide in this paper.

Figure 1 :
Figure 1: Comparison of RBM estimation accuracy of features and their time courses with SMFs.

Figure 2 :
Figure 2: The processes of feature learning and time course computation from fMRI data by an RBM.The visible units are voxels and a hidden unit receptive field covers an fMRI volume.

Figure 3 :
Figure 3: Intrinsic brain networks estimated by ICA and RBM.

Figure 4 :
Figure 4: A smoothed gray matter segmentation of a patient and a healthy control: each is a training sample.

Figure 5 :
Figure 5: Effect of a DBN's depth on neighborhood relations.Each map is shown at the same iteration of the algorithm with the same k = 50.The color differentiates the classes (patients and controls) and the training (335 subjects) from validation (54 subjects) data.Although the data becomes separable at depth 1 and more so at depth 2, the DBN continues distilling details that pull the classes further apart.

Figure 6 :
Figure 6: A gray matter of MRI scans of an HD patient and a healthy control.

Figure 7 :
Figure 7: Patients and controls group separation map with additional unsupervised spectral decomposition of sMRI scans by disease severity.The map represents 3500 scans.

Table 1 :
Table1summarizes the precision and recall values in the F-scores and their standard deviations.Classification on fine-tuned models (test data)

Table 2 :
Classification on fine-tuned models (HD data)