Bayesian feature selection for radiomics using reliability metrics

Introduction: Imaging of tumors is a standard step in diagnosing cancer and making subsequent treatment decisions. The field of radiomics aims to develop imaging based biomarkers using methods rooted in artificial intelligence applied to medical imaging. However, a challenging aspect of developing predictive models for clinical use is that many quantitative features derived from image data exhibit instability or lack of reproducibility across different imaging systems or image-processing pipelines. Methods: To address this challenge, we propose a Bayesian sparse modeling approach for image classification based on radiomic features, where the inclusion of more reliable features is favored via a probit prior formulation. Results: We verify through simulation studies that this approach can improve feature selection and prediction given correct prior information. Finally, we illustrate the method with an application to the classification of head and neck cancer patients by human papillomavirus status, using as our prior information a reliability metric quantifying feature stability across different imaging systems.


Derivation of the full conditionals
In this section, we show the derivations of the full conditionals. Sampling from these equations is what allows us to carry out the MCMC steps summarized in Section 1.2 below. As the data matrix is split into selected and non-selected features, we derive the full conditionals for the non-selected variables x j(γ c ) , given the latent binary vector of selection indicators γ, and for the selected features x jk(γ) , given γ and the group specific means µ jk , separately. Note that the dimensions group-specific mean vectors µ k(γ) change depending on the number of non-zero elements of γ. To handle the changing dimension without the use of reversible jump Markov chain Monte Carlo, we place a normal pseudoprior (Carlin and Chib, 1995;Dellaportas et al., 2002) on the prior means for the non-selected variables: µ 0(γ c ) ∼ N (0, Γ 0(γ c ) ). (S1) We define the matrix Γ 0(γ c ) as a diagonal matrix with elements Γ 0(γ c ) = Diag(ω 2 pγ +1 , . . . , ω 2 p ). We assume an inverse-gamma prior on the diagonal terms ω 2 j ∼ IG(ã j ,b j ), j = p γ + 1, . . . , p.
The full conditionals of x j(γ c ) | γ are computed as follows: As σ 2 0j follows an inverse gamma distribution, we can integrate σ 2 0j out to obtain: The full conditionals of x jk(γ) |µ jk , γ are computed as follows: As σ 2 kj follows an inverse gamma distribution, we can integrate σ 2 kj out to obtain: .
The full conditional of µ k(γ) | γ is computed as follows: Because the model specification states that µ 0k(γ) follows a normal distribution with mean ν k(γ) and variance h 1 Γ k(γ) , and that ν k(γ) follows a normal distribution with mean m k(γ) and variance h 1 Γ k(γ) , we can conclude that And finally, we derive the full conditional for µ 0(γ c ) | γ: These derived equations are used in the MCMC algorithm to sample from the posterior distribution, as described in the next section.

MCMC algorithm
Here we provide the joint marginal posterior distribution of the parameters (γ, µ 0k ) where µ 0k = (µ k(γ) , µ 0(γ c ) ). The unnormalized full conditional distributions of the model, which were derived in Section 1.1 above, are summarized below: where n k is the number of subjects in the kth component.
We now provide more details on the sampling approach for the variable selection indicators γ and the component means µ 0k that depend on these. To simplify the sampling for this probit prior, we follow the work of Albert and Chib (1993) and Chiang et al. (2017) and introduce a latent variable z j , where

Sampling steps
Here we provide a summary of the MCMC algorithm, which, at a high level, consists of two steps. In each step, the given parameter under consideration is updated conditional on the current values of that parameter, as well as the additional parameters of interest: 1. The first step of the MCMC algorithm is a Metropolis-Hastings (MH) step on γ, following the "adddelete-swap" approach to search over the space of possible feature combinations (Savitsky et al., 2011). Specifically, we choose an index j at random, and then propose changing the value of γ j from 0 to 1 or from 1 to 0, or swapping two γ j 's with opposite values, accepting the proposed new γ N over the original γ O with probability: 2. The next step is a random walk MH step to sample the µ k(γ) . New values are proposed for every element µ kj that is in the model at that iteration, from µ N kj = µ O kj + ϵ, for ϵ ∼ N (0, v 2 ), for j = 1, . . . , p γ and k = 1, . . . , K. This transition is accepted with probability , 1 (S4)

Posterior summarization
Based on the sampled values for γ, we can compute the marginal posterior probability of inclusion for each feature j as the MCMC average of γ j . To identify a final set of selected features γ * , we consider the set of features with marginal posterior probabilities of inclusion above a certain threshold. Given the selected set of features, there are several possible approaches for obtaining posterior estimates of the corresponding µ values. If there are sufficient MCMC iterations where the final set of selected features corresponds to the set of features included in that iteration, the posterior estimate of µ jk for each selected feature j in each group k can be obtained as the average of the sampled values over those iterations. For settings with a larger number of selected predictors, this becomes a combinatorial search problem, and there may not be a sufficient number of iterations with the same configuration of γ; instead, µ jk can be estimated as the marginal average for each selected predictor j over the iterations where it was included, or additional MCMC iterations can be run to resample µ k(γ) with γ fixed to the final set of selected features.

Prediction
The fitted model can also be used to make a classification prediction for newly observed data. If presented with a new observation with feature data x new , we use the predictive distribution of x new , p k (x new γ * ), to classify the sample. Note that this distribution is using only the selected set of features γ * . For a given class k, this distribution is In b new k of this formulation, µ jk is the posterior estimate computed as described in the previous section. We want to find the probabilities that the new data sample belongs to each of the groups k, given the existing observed data. Thus, using g new as the class membership variable of the new observation, we want to find the prediction probability We assume that the membership proportion between classes is exchangeable between the training and future/validation data sets and so can estimate the probability of an observation being in a given group k, π k = P(g = k), asπ = n k n . This assumption allows us to write the above equations together as The new data observation is then assigned to the class k for which this probability has the greatest value.

ADDITIONAL SIMULATION SCENARIOS
In this section, we summarize comparative performance for the following additional simulation scenarios: stronger class imbalance, varying number of predictors, varying signal strength, and varying ratio of discriminatory to non-discriminatory features. The basic simulation design is as in the primary simulation reported in Section 4.1.2 of the main manuscript. All simulation results reflect averages over 100 simulated data sets.

Stronger class imbalance
In addition to the even split (50% vs. 50%) and moderate imbalance (60% vs. 40%) settings summarized in Section 4 of the main manuscript, we performed an additional simulation study with a stronger class imbalance (90% vs. 10%), keeping the total sample size fixed at n = 100. Simulation results are summarized in terms of variable selection (Table S1) and classification accuracy on a test set (Table S2). The strong class imbalance makes both the variable selection and prediction tasks more challenging, resulting in lower MCC values for feature selection and lower Youden's Index for the prediction of class labels than in the simulations included in the main manuscript with either balanced or moderately imbalanced data.  Table S2. Classification accuracy: The mean TPR, FPR, and Youden's index when using data with an 90% vs. 10% class imbalance as described in Section 2.1.

Varying number of predictors
The primary simulation setting includes 4 discriminatory features and 100 noise features. Here, we consider additional scenarios with varying p, keeping the ratio of discriminatory to noise features the same. Specifically, we considered scenarios with the following settings in terms of the number of discriminatory features p d and the number of non-discriminatory features p n : p d = 1, p n = 25; p d = 2, p n = 50; and p d = 8, p n = 200. Simulation results are summarized in terms of variable selection (Table S3) and classification accuracy on a test set (Table S4). In general, the results were consistent with the primary simulation, in that the proposed RVS method achieved the highest MCC for feature selection and highest Youden's index for classification. Across all methods, classification accuracy improved as the number of discriminatory predictors increased.

Varying signal strength
The primary simulation assumes a mean for the discriminatory features of 1 in the first group and −1 in the second group. Here, we consider reducing the signal strength by decreasing the separation between the group means for the discriminatory features. Specifically, we considered settings with group means for the discriminatory features of −0.6 in the first group vs. 0.6 in the second group, and −0.8 in the first group vs. 0.8 in the second group. The most noticeable trend was a decrease in Youden's index for correct group label prediction on the test data across all methods with decreasing separation between the groups.

Varying ratio of discriminatory to non-discriminatory features
In this section, we summarize simulation results for scenarios with the same number of non-discriminatory features as in the primary simulation (p n = 100), but either fewer (p d = 2) or more (p = 10) discriminatory variables. In the setting with very sparse signal, the lasso tended to select too many features, resulting in a higher FPR and lower MCC relatively to other simulation scenarios considered. Across all methods, the setting with a lower ratio of discriminatory features to noise features resulted in lower classification accuracy (Youden's Index) on the test set.  Table S8. Classification accuracy: The mean TPR, FPR, and Youden's index in simulation settings with varying ratio of discriminatory to non-discriminatory features, as described in Section 2.4.

Timing comparison
In this subsection, we consider the scalability of RVS, lasso, and SVM in terms of computing time for both increasing p and increasing n. All three methods were run using Matlab R2020b. We conducted simulations with various settings for p, the number of predictors, as described in Section 2.2. As shown in Figure S1, the proposed method (RVS) is slower than lasso or SVM; however, the time to run 100,000 MCMC iterations for p = 208 is still less than 1 minute on average, which is not exorbitant. As illustrated in Figure S2, we considered settings with n up to 10,000. These simulation results show that the timing for RVS was stable for increasing n, while the time for SVM increased relatively steeply for n ≥ 5000.

SENSITIVITY ANALYSIS
In order to assess the model's sensitivity to changes in the setting of the hyperparameters, we noted how the posterior probability of inclusion (PPI) changes for both the discriminatory variables and the noise variables as the values of the hyperparameters were changed. The plots in Figure S3 show the change in the average PPI for discriminatory variables and noise variables in the simulation setting as the values for the three parameters c, α 0 , and α 1 are varied.

CONVERGENCE
In this section, we provide additional details on the convergence of µ in the RVS model for both the simulations and case study. Traceplots for the selected variables for an example simulated data set from the primary simulation described in Section 4.1.2 of the main manuscript are provided in Figure S4. Traceplots for the selected variables for the case study described in Section 4.2 of the main manuscript are provided in Figure S5. These plots were created using the mcmc trace() function from the R package bayesplot (Gabry et al., 2019).R values were computed as described in Section 4.1.5 of the main manuscript. A histogram of theR values obtained across the 100 simulated data sets for the primary simulation described  Figure S2: Timing comparison with increasing n. The x-axis shows the total number of samples n, and the y-axis shows the mean computational time in seconds, along with a vertical bar for the 95% confidence interval across simulated data sets. For RVS, the runtime includes 100,000 MCMC iterations. For the lasso, the run time includes cross-validation for penalty parameter selection.
in Section 4.1.2 of the main manuscript are provided in Figure S6. The meanR value was 1.002, the median was 1.001, and the maximum was 1.048.R values for the case study are reported in Table S9. Figure S3: Change in average PPI for discriminatory variables (in black) and noise variables (in red) as the values for c, α 0 , and α 1 are varied We also explored how reducing the number of MCMC iterations affected convergence. When we ran shorter chains with 10,000 iterations as the basis for inference, the meanR value was 1.018, the median was 1.011, and the maximum was 1.374 across all simulated data sets. These results suggest that most chains converged well, but there were a limited number of cases with somewhat higherR values. Since the computational time was not prohibitive, we allowed for a more generous number of MCMC iterations to ensure convergence was reached for all features in all settings. For the case study, the mean, median, and maximumR values for µ with 10,000 post-burn-in iterations were 1.0018, 1.0016, and 1.0044.