Identifying predictors of clinical outcomes using the projection-predictive feature selection—a proof of concept on the example of Crohn’s disease

Objectives Several clinical disease activity indices (DAIs) have been developed to noninvasively assess mucosal healing in pediatric Crohn’s disease (CD). However, their clinical application can be complex. Therefore, we present a new way to identify the most informative biomarkers for mucosal inflammation from current markers in use and, based on this, how to obtain an easy-to-use DAI for clinical practice. A further aim of our proof-of-concept study is to demonstrate how the performance of such a new DAI can be compared to that of existing DAIs. Methods The data of two independent study cohorts, with 167 visits from 109 children and adolescents with CD, were evaluated retrospectively. A variable selection based on a Bayesian ordinal regression model was applied to select clinical or standard laboratory parameters as predictors, using an endoscopic outcome. The predictive performance of the resulting model was compared to that of existing pediatric DAIs. Results With our proof-of-concept dataset, the resulting model included C-reactive protein (CRP) and fecal calprotectin (FC) as predictors. In general, our model performed better than the existing DAIs. To show how our Bayesian approach can be applied in practice, we developed a web application for predicting disease activity for a new CD patient or visit. Conclusions Our work serves as a proof-of-concept, showing that the statistical methods used here can identify biomarkers relevant for the prediction of a clinical outcome. In our case, a small number of biomarkers is sufficient, which, together with the web interface, facilitates the clinical application. However, the retrospective nature of our study, the rather small amount of data, and the lack of an external validation cohort do not allow us to consider our results as the establishment of a novel DAI for pediatric CD. This needs to be done with the help of a prospective study with more data and an external validation cohort in the future.


Supplementary Appendix for "Identifying predictors of clinical outcomes using the projection-predictive feature selection -a proof of concept on the example of Crohn's disease"
1 Statistical methods

Rationale
The rationale behind our statistical approach (a Bayesian ordinal regression model and a projectionpredictive feature selection) is as follows: 1. Existing disease activity indices (DAIs) (1-6) largely rely on the discretization of predictors, which comes with a loss of information and precision.2. Regression models provide a natural way for prediction (7) and can avoid the discretization of predictors.3. Since the outcome (the endoscopic score) is ordinal, an ordinal regression model is the most appropriate one.4. Bayesian statistics has various advantages compared to frequentist statistics, most notably a more intuitive interpretation (8)(9)(10).Thus, it makes sense to use a Bayesian ordinal regression model.5.In a Bayesian framework, the projection-predictive feature selection (PPFS) implemented in the R (11) package projpred (12)(13)(14) is a state-of-the-art variable selection method (13)(14)(15) which is especially suitable for prediction problems like in our case.In particular, the PPFS allows for valid post-selection inference (apart from the potential-but usually small (15)overfitting induced by the selection of the submodel size) because it does not discard the uncertainty inherent to the reference model (here the full model containing all candidate predictors) (15).
The rather complex calculation of the predictive probabilities (based on the model selected by the PPFS) is performed most conveniently by a Shiny (16) web application that users interact with via a simple graphical interface.
In settings with a binary outcome, a receiver operating characteristic (ROC) curve is a widely-applied and useful tool to compare different diagnostic tests (or DAIs, which may be considered as diagnostic tests).In our case, however, the outcome (the endoscopic score) has four categories, hence the need for a more general method.

Bayesian statistics
An introduction to Bayesian ordinal regression models (especially in the context of the R package brms (17)(18)(19)) is given in (20).For readers not familiar with Bayesian statistics (in general), we give a short introduction here, taken mainly from the shinybrms (21) application: The fundamental principle of Bayesian statistics is Bayes' theorem.In the context relevant for this publication, Bayes' theorem may be simplified to the statement that the joint posterior density of all parameters in the regression model is proportional to the product of their joint prior density and the likelihood.The posterior distribution (corresponding to the posterior density) reflects the knowledge about the parameters after seeing the data.The prior distribution (corresponding to the prior density) reflects the knowledge about the parameters before having seen the data.The likelihood corresponds to the distribution of the outcome conditional on the parameters and the predictors.Thus, after having specified the likelihood and the prior (distribution), a Bayesian data analysis aims to infer the posterior (distribution) and then to perform analyses based on the posterior, e.g., by plotting marginal posterior distributions and calculating their 2.5%, 50%, and 97.5% quantiles.Inferring the posterior is the Bayesian way to estimate unknown parameters.

Projection-predictive feature selection
The projection-predictive feature selection (PPFS) implemented in the R package projpred (12)(13)(14) is a predictor selection method whose resulting models have a predictive performance close to the gold standard prediction method (Bayesian model averaging or-equivalently-a spike-and-slab prior applied to the full model) (15).In the following, we give more details on the PPFS in general as well as on its use in our proof-of-concept study.In doing so, we mention to which values the arguments of involved R functions are set, at least for the most important ones.The full R code is supplied together with our manuscript.
The PPFS requires a reference model, which should be the best model (in terms of predictive performance) one can construct (13).Our approach of constructing the reference model from the full model (i.e., including all candidate predictors) with a regularized horseshoe (RH) prior ( 22) is a standard approach for constructing the reference model, as previously described (13)(14)(15)23).The RH prior is a so-called "shrinkage" prior for population-level regression coefficients (also termed "fixed" effects).The idea of the RH prior is to shrink irrelevant coefficients to zero and leave relevant coefficients as they are.We infer the global scale parameter of the RH prior from the ratio of the number of a priori deemed relevant regression coefficients to the number of a priori deemed irrelevant regression coefficients.Thereby, we set the number of a priori deemed relevant regression coefficients to 5, which is based on a rough guess of the total number of regression coefficients implied by the existing disease activity indices (DAIs).However, we have found our results to be largely insensitive to this number of a priori deemed relevant regression coefficients.
With the reference model set up, the PPFS first tries to find the smallest subset of predictors which still achieves a predictive performance as close as possible to the reference model's predictive performance.To avoid the investigation of all possible subsets of predictors (i.e., all possible submodels, which is computationally expensive), the PPFS runs a heuristic search to identify a single (best) submodel for each submodel size.For a multilevel reference model as in our case, the so-called "forward search" is the only option available in projpred.During this forward search, the reference model is "projected" onto each submodel that the search passes through (in fact, this projection is a central component of the PPFS and achieved by minimizing the discrepancy from the reference model's predictive distribution to the respective submodel's predictive distribution).For the "cumulative" outcome family we are using here, we apply the latent projection (24) as implemented in projpred.For post-projection analyses, we rely on projpred's default of so-called "response-scale analyses" which means that internally, the ordinal intercepts (the latent category thresholds) from the reference model are employed to turn a latent Gaussian submodel back into an ordinal regression model.To save computation time, we stop the forward search at submodel size 3 (this is sufficient here to see that size 2 should be chosen, see Figure 2 of the main article).For the number of clusters (of posterior draws) employed during that search, we use the default value of 20.We also tried a high value of 800 clusters (results not shown), but the full-data predictor ranking stayed the same and the fold-wise predictor rankings essentially, too (there were only two CV folds with very minor differences in the predictor rankings).Thus, we picked the default value of 20 clusters which is computationally easier to handle.After the search part, the PPFS consists of a performance evaluation part, yielding a plot of a user-specified predictive performance measure in dependence of the submodel size.As predictive performance measure, we choose the mean log predictive density (MLPD) because it is well-established (13,25) and because exponentiating it to the base of the natural logarithm gives the geometric mean predictive density (GMPD) as a closely related predictive performance measure which has an intuitive interpretation in our case: Here, the GMPD is the geometric mean of the predictive probabilities at the observed outcome categories (and hence restricted to the interval [0, 1]).For the creation of the predictive performance plot that is used for deciding a submodel size, the projpred package projects the reference model onto the best submodel of each size again.In these re-projections, we use 2666 thinned posterior draws (2666 is calculated as 8000 / 3, rounded down to the nearest integer; 8000 is the number of post-warmup posterior draws for the reference model, see section 1.4 here in the Supplementary Appendix).We cannot use the full 8000 posterior draws but only approximately every third of them due to computational constraints.However, since results were only slightly affected by our change from the default of 400 thinned draws (not shown here) to 2666 thinned draws, we think that switching to the full 8000 draws would have had only a very minor impact on the results.By inspecting the predictive performance plot created from the PPFS's performance evaluation results, the user then has to select a submodel size.
Here, we choose that submodel size from which on the predictive performance measure does not improve anymore.
With a chosen submodel size, the selected submodel is given by the (best) submodel corresponding to that chosen submodel size.The reference model is then projected once again onto the selected submodel.Here, we perform this final projection using the "draw-by-draw" method (13).These final projection results are used for our Shiny web application as well as for Figure 5 of the main article.
For the PPFS, it is recommended to perform a cross-validation (CV) around the heuristic search and the performance evaluation (13,15).Here, we perform a 25-fold CV.
Furthermore, we guard against the possibility of an illegitimate advantage of our approach (which considers group-level effects for the patient identifiers (patient IDs) as a candidate predictor) compared to the existing DAIs (which do not use group-level effects, see section 1.5 here in the Supplementary Appendix) by treating the original patient IDs as new patient IDs during predictions.This is achieved by setting projpred.mlvl_pred_new(a global R option used by the projpred package) to TRUE.This choice also reflects the later prediction task in clinical practice: In the Shiny application that we propose, all user input is treated as that for a new patient because the purpose of the Shiny application is to generalize prediction beyond the dataset used for training the model.
The reason why we subject the IBD center as well as the group-level effects for the patient IDs to variable selection is simply a technical one: Within projpred, it is currently a lot easier (and-in some regard-more efficient in terms of memory used) to consider all predictors from the reference model as candidate predictors than to exclude some.Retrospectively, our approach can be justified by the fact that neither of these two terms has been among the two most relevant terms (neither in the fulldata nor in the fold-wise searches; fold-wise predictor rankings not shown here for the sake of brevity).If either of these two terms had been among the two most relevant, then it would have made sense to run a final variable selection with these two terms excluded from the set of candidate predictors.Nonetheless, first running the "simpler" variable selection (with these two terms considered as candidate predictors) can give helpful insight: If they had been among the two most relevant terms, this would have meant that they are relevant for prediction, which would have revealed some important peculiarity of the data.

Reference model fit and analysis
For fitting our reference model, we increase the number of Markov chain Monte Carlo (MCMC) iterations per Markov chain from the default (2000) to 4000.Given that half of this number is for warmup (by default) and given the default of 4 Markov chains, this yields a total of 8000 postwarmup posterior draws used for inference here.Furthermore, we set the target Metropolis acceptance rate to 0.99 (the default in the R package rstanarm (26), for example, is 0.95 but we had to increase it a little further to aim for smaller step sizes of the underlying special MCMC sampler and hence to eliminate spurious divergences) and the maximum tree depth to 15 (which is the default for most rstanarm models, for example).
As part of a Bayesian modeling workflow (27), it is recommended to perform posterior predictive checks (PPCs): A good model should be able to generate outcome values close to the observed ones when giving the observed data to the model and letting the model learn from it.Here, we do not present PPCs for the reference model since the reference model should have a PPC performance at least as good as the selected submodel.We were able to confirm this in our case (results not shown).Furthermore, Figure 6 of the main article includes the predictive performance of the selected submodel and thus can also be interpreted as a conservative out-of-sample PPC for the reference model.
As another step to ensure that we have a reasonable reference model, we use the R package loo (28) to estimate the Pareto k-values and the effective number of parameters from a Pareto-smoothed importance sampling leave-one-out cross-validation (PSIS-LOO CV) (29, 30): 1.For each observation (visit), we obtain a single estimated Pareto k-value.High Pareto kvalues (> 0.7) indicate highly influential observations (29,30), which is usually undesired and often requires adjustments to the model.Even more importantly, they also indicate that the corresponding PSIS-LOO CV weights (as well as all downstream quantities based upon them, including the effective number of parameters from the next step) may be unreliable.2. After inspecting the Pareto k-values from the original PSIS-LOO CV, we update the PSIS-LOO CV results for visits with a Pareto k-value > 0.7 by refitting with each of these observations left out, as implemented in the brms package.Then, after these updates, we obtain a single effective number of parameters for the whole model.We wish to have an effective number of parameters smaller than the number of visits and smaller than the total number of parameters in the model (31).
Since the primary purpose of our regression model is prediction, we do not visualize the marginals of the selected submodel's projected posterior directly, but we create conditional-effects plots (Figure 5 of the main article) whose interpretation is explained in the respective figure caption.

Comparison of the selected submodel to existing pediatric Crohn's disease activity indices
Missing values in our dataset caused different numbers of usable visits for the different DAIs when performing a complete-case analysis for each DAI.In principle, it would have been desirable to train the DAIs and cross-validate their predictive performance on a common dataset (the largest possible one), but reducing our proof-of-concept dataset to the largest possible common dataset across all DAIs would have led to only 46 visits.Thus, the comparison of the DAIs presented in Figure 6 of the main article takes the maximum possible number of visits for a separate complete-case analysis of each DAI (which results in datasets of differing sizes for the different DAIs, as indicated by "N" in Figure 6).A future validation study of our proof-of-concept analysis presented here should conduct the comparison of all DAIs on a common dataset.
The comparison itself takes place via cross-validated predictive probabilities (CV-PPs; termed "predictive probabilities for the observed endoscopic inflammation" in the main article): First, for each existing DAI, we fit one Bayesian ordinal regression model similar to the reference model described in section 2.4 of the main article (and in sections 1.3 and 1.4 here in the Supplementary Appendix), but with the respective DAI (before any categorization of its score value, i.e., using its underlying continuous score) as the only predictor.For the regression coefficient of this single predictor, we use brms's default (flat) prior.(In principle, a flat prior for regression coefficients is not recommended due to the risk of overfitting, but in our case, a more restrictive prior could have meant a systematic disadvantage for the existing DAIs, and thus an advantage for our selected submodel.In this regard, the flat prior is a conservative choice with respect to the comparison with our selected submodel.)Then, for a given DAI (either existing or the selected submodel) and a given visit  ∈ {1, … , } (with  denoting the number of visits), the CV-PP is the probability given to the actually observed outcome category of visit  by the DAI's regression model when leaving the CV fold containing visit  out of the model building process.Thus, higher CV-PPs are better than smaller ones.For the existing DAIs, we use a 25-fold CV, as in the PPFS yielding the selected submodel.Finally, the CV-PPs are illustrated using separate boxplots (one per DAI) with the individual values (those of which the boxplots consist) overlaid as jittered points.Furthermore, we use such jittered boxplots also for the CV-PP differences, comparing the selected submodel vs. each existing DAI so that a CV-PP difference greater than zero speaks in favor of the selected submodel.
For each of the existing DAIs, this CV-PP approach implicitly assumes that its underlying continuous score has a linear effect on the latent predictor of its ordinal regression model.In section 2.2, we provide a sensitivity analysis showing that our results would not have changed much when allowing these effects to be nonlinear (using brms's built-in support for smooth terms; details are provided in our publicly available R code).

Reference model
In the reference model, we observed only a single visit with a high Pareto k-value (> 0.7).Given such a small number of highly influential observations, we consider this model to be adequate in this regard.The refit for this single visit (see section 1.4 here in the Supplementary Appendix) was successful, yielding an estimate for the effective number of parameters of approximately 38.6, which is much smaller than the number of visits (131, after exclusion of visits with missing values) and the total number of model parameters (143), as desired.

Sensitivity analysis for nonlinear effects of the existing DAIs
Figure A1 replicates Figure 6 from the main article, but allowing for nonlinear effects in the existing DAIs' ordinal regression models.As can be seen by comparing Figure A1A to Figure 6A, this extra mostly leads to very similar CV-PP values, a noticeable exception being the MINI for which the median CV-PP increases from ca. 45% to ca. 50%.Interestingly, however, comparing Figure A1B to Figure 6B reveals that even for the MINI, this extra flexibility does not change the CV-PP difference to the selected submodel considerably (for the MINI, the median CV-PP difference even increases when allowing for nonlinear effects, hence speaking more strongly in favor of the selected submodel).

Figure A1 .
Figure A1.The same as Figure 1 from the main article, but allowing for nonlinear effects in the existing DAIs' ordinal regression models.