A Hierarchical Bayesian Model for the Identification of PET Markers Associated to the Prediction of Surgical Outcome after Anterior Temporal Lobe Resection

We develop an integrative Bayesian predictive modeling framework that identifies individual pathological brain states based on the selection of fluoro-deoxyglucose positron emission tomography (PET) imaging biomarkers and evaluates the association of those states with a clinical outcome. We consider data from a study on temporal lobe epilepsy (TLE) patients who subsequently underwent anterior temporal lobe resection. Our modeling framework looks at the observed profiles of regional glucose metabolism in PET as the phenotypic manifestation of a latent individual pathologic state, which is assumed to vary across the population. The modeling strategy we adopt allows the identification of patient subgroups characterized by latent pathologies differentially associated to the clinical outcome of interest. It also identifies imaging biomarkers characterizing the pathological states of the subjects. In the data application, we identify a subgroup of TLE patients at high risk for post-surgical seizure recurrence after anterior temporal lobe resection, together with a set of discriminatory brain regions that can be used to distinguish the latent subgroups. We show that the proposed method achieves high cross-validated accuracy in predicting post-surgical seizure recurrence.


ROI ABBREVIATIONS
ROI abbreviations from case study are shown in Table S1.

IMPORTANCE-SAMPLING APPROACH TO CROSS-VALIDATION PREDICTION
In this section, we detail the importance-sampling approach to cross-validation prediction of Gelfand (1996). The cross-validation predictive density for the ith observation may be written as where we use p(η, β|X, Y ) as an importance sampling density for p(η, β|X, Y −i ), and Y −i denotes the non-hold out outcomes. More specifically, we estimate the probability that the ith observation has outcome Y i = 1 as where the importance weights are given by and η (t) , β (t) are the MCMC samples at the tth iteration.  Table S1. Temporal lobe epilepsy dataset: Abbreviations of regions used in case study.

CONNECTIVITY NETWORK BASED ON RS-FMRI
Rs-fMRI imaging was acquired on a 3T MRI system (Siemens Trio, Erlangen, Germany). Functional imaging was performed with the following parameters: TR=2000 ms, TE=30 ms, FOV=210 mm, matrix=  64 × 64, slice thickness 4 mm, 34 slices. The imaging sessions included multiple fMRI recordings, each lasting 5 to 15 minutes. For resting state fMRI analysis, 20 minutes of BOLD fMRI data was used for each subject. Preprocessing of rs-fMRI imaging was performed using FSL (fMRIB Software Library) version 5.0.7 (Oxford, United Kingdom, www.fmrib.ox.ac.uk/fsl) and included head movement artifact correction, non-brain tissue elimination, high-pass filtering (100 s), spatial smoothing and mean-based intensity normalization. Tissue-type segmentation was performed on each participant's structural image using FAST (FMRIB's Automated Segmentation Tool) (Zhang et al., 2001), before being aligned to their respective BOLD images. White matter signal and cerebrospinal fluid signals were obtained using the segmented masks. Functional connectivity between the 47 ROIs was estimated by placing a 6-mm spherical seed in Montreal Neurological Institute (MNI) space at the location of each of the 47 ROIs. The coordinates of each sphere in standard MNI space are listed in Table S2 below. Each patient's fMRI BOLD image was registered to the patient's high-resolution structural image using FLIRT (FMRIB's Linear Image Registration Tool) (Jenkinson et al., 2002;Greve and Fischl, 2009), and the high-resolution structural was registered to the standard MNI space using FNIRT (FMRIB's Non-linear Image Registration Tool) (Andersson et al., 2007). The transformation matrix and warpfields were inverted, and then applied to the 47 spherical seeds to obtain spherical seeds in each individual's BOLD space. Functional connectivity between each pair of nodes was computed as the partial Pearson correlation between the averaged regional time-series. This provided us with a 47 × 47 correlation matrix. An edge was then considered as included in the connectivity network if the correlation between the regions exceeded a given threshold. The threshold was chosen so that the average number of neighbors for each region was approximately 5, yielding a connectivity structure close to a three-dimensional lattice.

COMPARISON TO PENALIZED REGRESSION METHODS THAT DO NOT CONDITION ON A LATENT STATE
In addition to the comparison to multi-step approaches in Section 3.3, we compared to methods such as elastic net (Zou and Hastie, 2005), ridge regression (Hoerl and Kennard, 1970), and the Least Absolute Shrinkage and Selection Operator (LASSO) method of Tibshirani (1996) that, in particular, do not condition on latent states, but rather use the X data as the covariates. Mixing and regularization parameters for the elastic net using a two-dimensional grid search to minimize cross-validated error. LASSO and ridge regression regularization parameters were chosen through 5-fold cross-validation based on the one standard error rule. Leave-one-out cross validation was used to assess predictive performance. An AUC of 0.64 was obtained with elastic net. Both LASSO and ridge regression failed to predict post-surgical outcome, with all hold-out observations classified as seizure-free.

PERFORMANCE EVALUATION ON PET-BASED SYNTHETIC DATA
In order to assess our method we evaluate performances using synthetic data that are intended to mimic real PET data. We also investigate comparisons with multi-step approaches as well as approaches that do not condition on latent states.

Synthetic data
We simulated synthetic PET data based on the data we have available from the University of California, Los Angeles Seizure Disorder Center. Our approach is similar in spirit to the simulation strategies adopted by Zhang et al. (2014) and Quirós et al. (2010) for fMRI data. Specifically, we considered real PET data from a single patient and obtained synthetic data on n = 20 subjects as where X denotes the PET data from the selected patient from the real PET study, and δ i is simulated as described below. We generated data for K = 2 groups of subjects, with n 1 = 10 subjects in group 1 and n 2 = 10 subjects in group 2. For patients in group 1, we set δ i to be an R-dimensional vector with δ i,j ∼ Unif(0.5, 1) for all γ j = 0 and δ i,j ∼ Unif(0.4, 0.7) for all γ j = 1. For patients in group 2, we set δ i to be an R-dimensional vector with δ i,j ∼ Unif(0.5, 1) for γ j = 0 and δ i,j ∼ Unif(0.8, 1.1) for γ j = 1. The true map of γ was set to define 6 ROIs with γ j = 1, located in the bilateral caudate, bilateral lentiform nuclei, and bilateral primary visual cortices, as shown in Figure S1. In order to simulate the clinical outcome data, we set with ξ i = (1, I(η i = 1)) and β = (2.3, −4.5). Our final synthetic dataset comprised data on R = 47 regions for n = 20 subjects. Data were column-centered prior to the analysis.

Results
As done in the case study, we set hyperparameters to be weakly informative. We set the unscaled variance of the ICAR prior to c k = 5, for k = 1, 2. We placed a vague prior on the hyperparameters of the mixing proportions π, that is, α k = 1, and fixed the prior shape and scale parameters of the inverse gamma priors on σ k and σ 0 to be non-informative with a 0 = 2, b 0 = 1, a k = 2, and b k = 1. We also fixed the prior mean and covariance of β to m β = 0 and V β = 5 · I K , respectively. We set the neighborhood matrix, S, of the MRF prior and the ICAR prior to a binary matrix with the bilateral caudate, bilateral lentiform nuclei, bilateral primary visual cortices, bilateral associative visual cortices, ipsilateral thalamus, and contralateral Broca's area as neighbors. We first show results we obtained by setting the MRF parameters to e = −1.4, which implies a lower bound on the prior probability of selection of 20% of the total number of regions, and f = 0.1, and then comment on sensitivity below.
When running the MCMC, we initialized the chain with 0.5R randomly selected regions as discriminatory and an equal number of randomly selected observations assigned to the two clusters. We initialized the remaining values as µ k = 1, for k = 1, 2, and σ (0) 0 = 1, β (0) = 0. We ran the MCMC sampler for 20,000 iterations, with the first 10,000 sweeps discarded as burn-in. The Geweke z-statistic ranged from 0.61-1.14, failing to reject the null of equality between the means of the first 10% and last 50% of the chain. The Raftery-Lewis dependence factor, calculated on the sampled values of β, ranged from 1.19-1.26, indicating low autocorrelation of the chain. With our R implementation, 1,000 iterations of the MCMC algorithm ran in 33.9s on an Intel Core i7 station (2.50 GHz) with 16GB RAM. Figure S2 shows the marginal posterior probabilities of inclusion (PPIs) for each of the 47 brain regions. The median model here results in the accurate identification of the 6 discriminatory regions. Also, a classification of the subjects into two subgroups, based on the posterior mode of class allocation, led to accurately classify all subjects to the correct subgroups. Finally, the posterior means and 95% credible Frontiers Figure S2: Synthetic data: Marginal posterior probability of inclusion for each of the 47 brain regions. intervals for β 0 and β 1 were estimated as 2.82 (1.13,4.88) and -4.26 (-6.46,-2.28), respectively, leading to good estimates of these parameters, with true values contained in the 95% credible intervals.
In order to better assess the selection performances of our method, we looked at results over repeated simulations and calculated FPR (false positive rate), FNR (false negative rate), accuracy and F 1 -score, all averaged over 30 replicated datasets. Here, FPR is defined as F P R = F P F P +T N , with F P the number of false positives and T N the number of true negatives. FNR is defined as F N R = F N F N +T P , with F N the number of false negatives and T P the number of true positives. Accuracy is defined as Lastly, F 1 -score is defined as F 1 = 2 · (T P/(T P + F P )) (T P/(T P + F N )) (T P/(T P + F P )) + (T P/(T P + F N )) .
With the hyperparameters setting described above, we obtained FPR= 0.000, FNR=0.030, accuracy of 0.996 and an F 1 -score of 0.967. For cluster allocation, we quantified performances via the Rand index (RI) of (Rand, 1971). Let us define the following quantities: and takes values between 0 and 1. The larger the index, the more accurate the clustering result. We obtained a mean Rand index of RI = 0.98, over the 30 replicates. Furthermore, the means and standard deviations of the posterior mean estimates for β 0 and β 1 were 1.91 (0.81) and -3.67 (1.16), respectively.
When investigating sensitivity to the prior choices, we noticed that modest changes of the values of the model parameters did not affect the accuracy of the estimation results while, as expected, we observed some sensitivity of the selection results to the parameters of the Markov random field prior, e and f . Tables S3-S4 report results on FPR, FNR, accuracy, F 1 -score, Rand index and estimated β's, all calculated over 30 replicates, for several choices of these parameters. Larger values of e lead to lower FNRs, while larger values of f lead to higher FPRs and lower precisions. Also, as expected, lower accuracy of variable selection was associated with a lower Rand index for cluster allocation and larger errors in the estimation of β 0 and β 1 .

Comparison study
For comparison, we first assessed the performance of our unified method, which performs simultaneous clustering based on selected variables and outcome prediction, against multi-step approaches that focus solely on clustering or solely on outcome prediction. We considered three multi-step procedures. In the first two approaches, regularized logistic regression through either (a) the Least Absolute Shrinkage and Selection Operator (LASSO) method of Tibshirani (1996) or the (b) elastic net penalized regression (Zou and Hastie, 2005) was first used to perform variable selection, using the observed outcome Y as the dependent variable. The LASSO regularization parameter, λ, was chosen to minimize the 5-fold cross-validation error. Regularization and mixing parameters in the elastic net were optimized using a twodimensional grid search to minimize cross-validated error. Imaging data were standardized by centering and scaling prior to penalized regression, as suggested by Efron et al. (2004). A Gaussian mixture model was then fitted to the selected regions using the expectation-maximization (EM) algorithm to obtain estimates of the cluster memberships η (Fraley and Raftery, 2006). These estimates of η were then used as the covariates in a logistic regression. The third multi-step approach we considered mimics an approach in which a subset of regions is selected a priori and used in a subsequent two-stage procedure. In this approach, for each subject, permutation p-values were calculated for each region and then the subject-level p-value maps were combined into a group-level p-value map using Fisher's method (Fisher, 1950). A subset of the regions was selected by thresholding the group-level p-value map with family-wise error rate control at the 0.05 level. Then, a k-means cluster analysis was performed using the selected regions to obtain estimates of η. Finally, a logistic regression was fitted using the estimates of η as the covariates in a logistic regression.
For all methods, we report results averaged over 30 replicated datasets. We evaluated variable selection performance via FNR, FPR, accuracy and F 1 -score, and cluster allocation performance via the Rand index. Results are reported in Table S5. Lower accuracy of selection is observed for all three multi-step methods. Whereas the LASSO multi-step approach tends to underselect brain regions (FNR, 55.0%), the elastic net and logistic regression approaches tend to overselect (FPR, 8.5% and 9.4%, respectively). As for the estimation of β, our unified method achieves estimates of β with higher accuracy as well as lower variance than all three multi-step methods. These results demonstrate that a unified and probabilistically coherent modeling approach can achieve improved estimation performance with respect to multi-step approaches. Next, we demonstrate that, when predictors are characterized by a true latent structure, as in our simulated scenario, a modeling approach that explicitly accounts for such structure is able to achieve superior prediction performance when compared to approaches that do not properly account for such heterogeneity in the data. As done in the case study, we focus the comparison on elastic net and LASSO, which do not condition on latent states but rather use the X data as the covariates. However, instead of calculating cross-validation predictions, for this comparison we simulated an additional set of n test = 15 synthetic observations, that comprised our validation set, and calculated posterior predictive probabilities for our method as described in Section 2.3.5. As above, regularization and mixing parameters for elastic net were optimized using a two-dimensional grid search to minimize cross-validated error. LASSO and ridge regularization parameters were chosen to minimize 5-fold cross-validation error. Predictive accuracy was assessed for all methods through hold-out validation on the set of n test = 15 synthetic observations. We calculated two measures of predictive performance: the AUC and the cross-entropy error measure, defined as whereŶ i is the predicted outcome for subject i. Mean AUC and cross-entropy error, averaged over 30 replicates, are reported in Table S6 and show that our method attains both lower cross-entropy as well as higher AUC compared to the competing regularized techniques.

MCMC ALGORITHM FOR POSTERIOR INFERENCE
A generic iteration of the MCMC algorithm comprises the following steps: 1. Update σ k and σ 0 : These are Gibbs steps, σ k ∼ IG(σ 2. Update π: This is a Gibbs step, with π ∼ Dirichlet(α 1 + n 1 , α 2 + n 2 , . . . , α K + n K ). 3. Jointly update (γ, {µ k } K k=1 ): This is a joint Metropolis-Hastings step. To propose a new candidate γ , randomly choose between two transition moves: a. Add/delete: Randomly choose one of the R indices in γ, and change its value either from 0 to 1, or 1 to 0.
An identifiability problem arises in finite mixture models due to the invariance of the likelihood under K! permutations of the component labels, resulting in equal posterior values under each permutation. This issue is usually referred to as "label-switching" and leads to problems with estimating component-specific quantities. In order to account for this, we adapted the post-MCMC decision-theoretic relabeling method of Stephens (2000) to our modeling framework. This method defines an appropriate loss function based on the Kullback-Leibler divergence, and postprocesses the MCMC output to minimize the posterior expected loss.

CONVERGENCE TESTS
Convergence of each individual chain was assessed using the Raftery-Lewis diagnostic  and the Geweke test (Geweke et al., 1991). Convergence of the multiple chains was assessed using the Gelman-Rubin potential scale reduction factor, based on the implementation in the R package "coda" . Values of the diagnostics tests for the results reported in the paper are given in Table S7 and indicate convergence to the stationary distribution.

CONFLICT OF INTEREST STATEMENT
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.