Regression calibration utilizing biomarkers developed from high-dimensional metabolites

Addressing systematic measurement errors in self-reported data is a critical challenge in association studies of dietary intake and chronic disease risk. The regression calibration method has been utilized for error correction when an objectively measured biomarker is available; however, biomarkers for only a few dietary components have been developed. This paper proposes to use high-dimensional objective measurements to construct biomarkers for many more dietary components and to estimate the diet disease associations. It also discusses the challenges in variance estimation in high-dimensional regression methods and presents a variety of techniques to address this issue, including cross-validation, degrees-of-freedom corrected estimators, and refitted cross-validation (RCV). Extensive simulation is performed to study the finite sample performance of the proposed estimators. The proposed method is applied to the Women's Health Initiative cohort data to examine the associations between the sodium/potassium intake ratio and the total cardiovascular disease.


. Introduction
The field of nutritional epidemiology plays a crucial role in understanding the impact of dietary patterns on human health. The ongoing exploration of associations between dietary components and chronic disease risks continually uncovers valuable insights. For instance, the well-established link between obesity and cancer risk (1) serves as a testament to the significance of this research. In order to effectively prevent and control chronic diseases, it is imperative to acquire detailed information on how key energy balance factors associate with the risks of major chronic illnesses [World Cancer Research Fund/American Institute for Cancer Research (2)]. Investigating the complex working mechanisms of these energy balance factors necessitates a comprehensive examination of the connections between multiple dietary components and disease risks. Establishing such associations, however, is far from simple. A major challenge stems from biases in dietary assessment, which are notoriously difficult to address (3). Strong evidence (4) suggests that the misreporting of dietary energy intake is associated with individual characteristics, such as body mass index (BMI). These systematic measurement errors result in estimation biases that cannot be automatically rectified (5). Moreover, correcting measurement errors becomes increasingly challenging when attempting to model dietary components jointly in the context of their relationships with chronic diseases.
Correcting measurement errors has been an important subject in statistical methodology development, greatly influencing nutritional studies (6). Various strategies have been developed to address these errors (7)(8)(9)(10)(11)(12)(13)(14)(15). One notable method, regression calibration, is particularly useful for handling covariate-dependent measurement errors and offers ease of implementation (16). Studies within the Women's Health Initiative (WHI) have demonstrated the effectiveness of joint regression calibration approaches in addressing measurement errors when objective biomarkers are available for all modeled dietary intakes (4,(17)(18)(19). These biomarkers inform calibration equations for self-reported measurements of exposure variables, which then provide calibrated intake estimates to better assess associations between dietary exposures and disease risks.
There is a significant research gap in generating reliable calibrated estimates for numerous nutritional and physical activity variables using single objective measurements. Consequently, regression models with multiple predictors have been developed from feeding studies to obtain calibrated estimates (4,20). For instance, in the WHI Nutrition and Physical Activity Assessment Study (NPAAS), a regression-based biomarker has been established for a single dietary component or energy balance factors (21). To address the systematic measurement errors in self-reported food frequency questionnaire (FFQ) data from a large cohort, blood and urine measurements were collected for a subgroup, while a feeding study (NPAAS-FS) was conducted on another smaller subgroup where both blood and urine measurements and assessed dietary intake information were collected. This novel feeding study design aimed to improve the accuracy of capturing measurement errors in the FFQ (21). However, there are some challenges in building the regression calibration method, as the classical measurement error assumption will be violated by the feeding study-based biomarker development procedure, which regresses the consumed nutrient on blood and urine measurements and personal characteristics. This issue arises because the residual of the regression model is independent of the predicted value instead of the actual one. Ignoring this violation results in biased estimates of the calibrated dietary intake and the diet-disease association due to Berksontype errors (22). When developing biomarkers for objectively measured variables with low dimensions, new calibration methods have been developed to account for Berkson-type errors in association studies of univariate nutritional variables (20). Zhang et al. (23) have extended this approach to multivariate nutritional variables, providing consistent estimators for disease associations of a single dietary component and valid confidence intervals for disease association parameters under rare disease settings. Nevertheless, for some macronutrient intakes, suitable biomarkers cannot be developed from low-dimensional measurements. Highdimensional metabolites offer an opportunity to establish valid biomarkers, but it remains an open question on how to obtain valid inferences for such biomarkers.
In this paper, we concentrate on high-dimensional objective measurements for a univariate exposure of interest, where the sample size is smaller than the dimension of the variables, in constructing a biomarker model. High-dimensional variable selection constitutes a significant portion of the rapidly advancing statistical frontiers today. Over the past few decades, numerous studies have been dedicated to understanding the performance of various variable selection techniques. Frank and Friedman (24) first proposed a technique called bridge regression. Breiman (25) introduced the nonnegative garrote for shrinkage estimation and variable selection. Lasso, an l-1 regularized least squares method, was studied and introduced by (26) for variable selection. Nonconcave penalized likelihood estimators, such as smoothly clipped absolute deviation (SCAD), were proposed by (27) and (28). Efron et al. (29) presented the least angle regression for variable selection and introduced the LARS algorithm. Zou and Li (30) proposed one-step sparse estimates for nonconcave penalized likelihood models and introduced the local linear approximation algorithm for optimizing non-concave penalized likelihoods.
Building a biomarker model with high-dimensional sparse data requires predictive performance that can effectively address the challenges associated with such data. One issue that arises when working with high-dimensional models is the collinearity among covariates, which can result in spurious correlations between variables (31). Numerous researchers have explored penalized regression techniques, such as Lasso and SCAD, to handle highdimensional sparse data. Alternatively, variable selection can also be done by ranking predictive powers using random forest (RF) (32). Variance estimation in high dimensional models presents its own challenges, due to factors such as collinearity among covariates and the presence of spurious correlations. A variety of techniques have been proposed to address the issue of variance estimation in high-dimensional regression methods. Cross-validation (CV), a popular resampling technique, has been widely applied to assess the performance of different models and obtain unbiased variance estimates (33). The bootstrap, another resampling method, has been employed to estimate the variability of regression parameters (34). The degrees-of-freedom corrected estimators, such as the generalized degrees of freedom and the effective degrees of freedom, provide better error variance estimates by accounting for the complexity of the models (35,36). The refitted crossvalidation (RCV) method is a modification of the standard crossvalidation procedure that improves the estimation of error variance in high-dimensional regression (37).
The remaining of the paper is organized as follows. In Section 2, we introduce the framework of the present study and the notation. In Section 3, we introduce different methods and detail variance estimation procedures. In Section 4, we conduct extensive simulations to evaluate the finite sample performance of our proposed estimators. In Section 5, we apply our method to the WHI data to estimate the effect of macronutrient intakes on the risk of various chronic diseases. Finally, in Section 6, we present our conclusions and discussions.

. Framework and notation
We aim to investigate the correlation between a particular type of dietary intake Z ∈ R [such as the (log-transformed) ratio of dietary sodium to potassium] and the timeframe, T, to the emergence of a specific chronic illness. Nevertheless, rather than directly observing Z, we only gather information on self-reported dietary intake Q ∈ R, which may deviate from Z depending on .
individual characteristics: Here, a ∈ R (2+q) is an unidentified parameter vector, and ǫ q is a random error with a mean of 0 that is independent of Z and V.
We also take into account potential confounding factors, referred to as personal characteristics V ∈ R q where q is the number of covariates. To model the hazard of the response, we employ a Cox model: , θ z is the parameter we are interested in, and λ 0 (t) represents a "baseline" hazard function.
In the NPAAS feeding study (NPAAS-FS), we furnish participants' meals with standardized food, which closely mimicking their regular diet, has well-documented nutrient content (21). The true unobservable dietary intake within the 2-week feeding period is denoted as In our current model, we assume that ǫ x is independent of Z and V. However, condition (3) could be considered somewhat restrictive, given that the design of the feeding study is based on reported long-term dietary intake and not the actual diet. To address this, we have modified this assumption such that the true short-term unobserved diet X does not necessarily need to be centered around Z. Additional specifics can be found in Section 3. One intricate issue related to the feeding study is the measurement errors arising from food packaging. For example, a pack of chips labeled as 100 calories might in reality contain 101 calories. Consequently, the observed short-term dietary intakeX during the feeding study can be expressed asX = X +ǫ x , wherẽ ǫ x ∼ N (0,σ 2 x ) is independent of ǫ x , Z, and V. The study is organized into three stages: the feeding study (Sample 1) for biomarker development, the biomarker substudy (Sample 2) for calibration equation development, and the association study (Sample 3) using the complete cohort to establish the disease association.
When self-reported intake Q data from feeding study samples is available, the bias of self-reported dietary intake can be directly calibrated (refer to Section 3.4). However, self-reported dietary intake Q is usually not available concurrently in Sample 1. To acquire that data, a long-term feeding study would be necessary, wherein participants report their dietary intake provided over preceding months (e.g., 3 months). Furthermore, the Q value obtained just prior to the feeding period in NPAAS-FS is not collected at the same time as biomarker W, and it might be inappropriately highly correlated withX.
As an alternative, we could employ a high-dimensional biomarker W ∈ R p , comprised of p blood and urine measurements obtained objectively, as a bridge betweenX from the feeding study sample and Q from a separate, larger sample. We assume that the blood and urine measurements W are influenced by the short-term diet X, whereas the self-reported questionnaire data are directly impacted by the long-term diet Z. We assume that W is possibly high-dimensional and follows a parametric model: where B ∈ R (2+q)×p is a matrix of unknown parameters and ǫ w ∼ N (0, σ 2 w I p ) is independent of ǫ x ,ǫ x , ǫ q , Z, V and B. In practical terms, our best option is to utilize the baseline Q gathered at a separate time (for instance, at baseline for Sample 3) for Sample 1. This baseline Q has been effectively used in studies concerning various dietary components [e.g., protein and carbohydrate; (22)]. However, a time gap exists between the data collection for this baseline Q and the timing of (X, V, W, Z) measurements in Sample 1. Consequently, there's a concern that the conditional distribution (Q|X, V, W, Z) in Sample 1 may differ from Samples 2 and 3 for specific dietary components. Even when Q is available, the feeding study's sample size is usually restricted, which could lead to less than optimal efficiency for disease association estimates. In such instances, we consider Q as unavailable in Sample 1 and use W to predictX.
The process of estimating the association between Z and T is divided into three stages, each utilizing distinct, non-overlapping samples derived from the same fundamental population: 1. the biomarker creation phase, 2. the calibration phase, and 3. the phase assessing the association. Each stage employs a different sample. The size of the sample used in stage k is denoted as n k . In Stage 1, there are n 1 samples, and for each individual i, we have access to data (X i , W ⊤ i , V ⊤ i ) and possibly Q i ; in Stage 2, n 2 samples are available, and for each individual i, we have ( where T i is the time of disease occurrence, and C i is a potential censoring time. Conventionally, T i and C i are assumed to be independent given (Q i , V ⊤ i ). During the first stage, we utilize data from the biomarker creation phase to develop the biomarker. This model can be constructed by regressing the observed short-term dietary intakes X on one of the following: (i) blood/urine measurements W and personal characteristics V; (ii) blood/urine measurements W, self-reported dietary intake Q, and personal characteristics V; (iii) self-reported dietary intake Q and personal characteristics V.
As earlier indicated, self-reported dietary intake Q may be deemed unavailable during Stage 1. If that's the case, we treat Q as unavailable and opt for choice (i) in Stage 1. When Q is accessible in Stage 1, choice (ii) might enhance the estimation of X. If the biomarker W is not available, option (iii) directly models X based on Q and V, but the effectiveness might be hampered by the limited sample size n 1 . In Stage 2, a calibration equation is developed using self-reported log-transformed dietary intake Q and personal characteristics V to predict actual intake X if options (i) or (ii) are implemented in Stage 1. If option (iii) is chosen, Stage .
/fnut. . disease association analyses with the available data on Q, V, and the composite survival outcome (T * , ). In summary, the high-dimensional regression calibration procedure has three stages: biomarker construction, calibration, and estimation. In Stage 1, the relationship between the true dietary intake X and high-dimensional biomarker W is established. If selfreported dietary intakes Q are not available, option (i) can be used. If Q is available in Stage 1, whether or not W is also available, relationships between X and Q can be directly established with option (iii). If both W and Q are available for Stage 1, one of the options from (i), (ii), and (iii) can be used. As discussed, (i) might lead to Berkson type error (38) and (iii) might have low efficiency. For Stage 2, we developed bias correction methods to account for the bias introduced by the Berkson type error. For Stage 3, we can use a multivariate approach to jointly study the associations between multiple dietary components and the disease risks.

. Methods
we first consider the case whereσ x is known. We propose methods to estimateσ x in the discussion section. In the real data analysis whereσ x is not available, we vary this parameter to perform sensitivity analysis.
With high-dimensional data on urine measurements (W), we first need to obtain estimated coefficients among n 1 subjects in the biomarker discovery sample of the observed short-term dietary intakeX on high-dimensional blood and urine measurements (W) as well as subject characteristics (V). Three different approaches including Lasso, SCAD, and RF are used to conduct variable selection in high-dimensional statistical inference. We will describe each approach explicitly for every method in the following subsections.
. . Method : the naïve three-step approach with multiple exposure In the first step, we need to fit a linear regression ofX on W and V: With Lasso approach, the coefficients,β 1 , minimize the penalized least squares (PL Lasso ) as below: Lasso performs variable selection by shrinking coefficient estimates toward zero leading to a sparse model. The tuning parameter λ is selected through cross-validation.
With the SCAD approach, a nonconvex penalty is given by: The first derivatives of PL SCAD (β 1j ) is continuous and is given by for some a > 2 and β 1 > 0. Similar to Lasso, λ in SCAD is selected through cross-validation based on the smallest mean square error (MSE) whereas a is set to be 3.7 based on simulation results and Bayesian statistical point of view from (27).
Other than penalized regression as we described above, RF is another choice for variable selection. The basic concept is to grow regression trees in the general form below: where R 1 , . . . , R M denotes a partition of feature space. Then we can repeat this procedure to build the RF by considering the approximate square root of the total number of predictors each time. The advantage of RF is we can see the contribution of each variable to the regression tree and their relative importance.
For each method, we did direct selection and post selection. For direct selection, we applied an estimated model from each approach to predict the long-term dietary intake straightly. For post selection, we performed linear regression afterward with selected variables (Ŝ) from each approach. Specifically, for Lasso and SCAD, we have: For RF, the 10 most important variables are considered as the final selected variables. For both direct and post selection, we considered two ways to deal with W and V; one is to consider both W and V in the approach of variable selection while the other is to consider only W. To be more specific, in Lasso and SCAD, the penalization will be applied to (W, V) or to only W, respectively. In RF, the decision trees will be built by considering (W, V) or only W, respectively. With estimatedβ 1 we had in the prior step, we can then computeX 1i = (1, W ⊤ i , V ⊤ i )β 1 to predict the long-term dietary intake (Z) among the n 2 calibration samples and run a regression ofX 1 on self-reported food frequency questionnaire data (Q) and V to build calibration equation using the n 2 calibration samples to estimate the parameter Using the Stage 3 sample, we then estimate Z as Z 1i = (1, Q i , V ⊤ i ) γ 1 for i = n 1 + n 2 + 1, · · · , n 1 + n 2 + n 3 . Finally, we estimate the association between Z and the time-to-event endpoint (T * , ) by solving the score equation for Cox model: Frontiers in Nutrition frontiersin.org where τ is a pre-specified large number and we assume P( In application, τ is typically defined as the largest follow-up time in the Stage 3 sample. . . Method : three-step with bias correction As shown in (20), for low-dimension setting, Method 1 will lead to a bias factor inẐ 1 when usingX 1 and a bias-corrected estimator has been proposed. so for this high dimensional setting, we propose a similar bias-corrected x is an estimated version of the bias factor. For direct selection, we used K-fold cross-validated errors to compute the Var(X|W s , V s ) in penalized regression and RF to obtain BF. Denote the predicted value for the k − th fold when using regression parameters from the other K − 1 training datasets as X 1k when using (W, V) as predictors and as X 2k when using V as predictors, then we have With X 1 and X 2 , BF can be calculated as For post selection, we first obtain the selected variables from the methods Lasso, SCAD, or RF. Afterward, we estimate the coefficients by refitting a linear regression. To facilitate interpretation, we consider both W and V in variable selection for the remainder of this subsection. Consequently, we have: Subsequently, we can fit a low-dimensional model as below: Here, W s and V s denote the selected W and V variables, while β PS WV and β V represent the corresponding coefficients in the aforementioned equations. From there, BF can be estimated as: where As demonstrated above, obtaining a precise estimation of Var(X|W s , V s ) is crucial for a reliable estimation of BF. Chatterjee and Jafarov (39) revealed that the estimator Var(X|W s , V s ), as mentioned earlier, leads to a downward bias when using Lasso. Therefore, we decide to compute and compare three different types of Var(X|W s , V s ) in our study involving post selection. We will provide a description of each type below.
(i) K-fold cross validation We fit penalized regression or RF with the cross-validated training dataset and get predictedX with selected (W, V) for each fold. Denote the selected subset as S k for each training set X −k . Denote W S k as selected W and V S k as selected V in the (K-1) training dataset for each fold, then we can fit a linear regression ofX k on W S k , V S k and the predicted value is denoted as X 1k . Also, we can fit a linear regression ofX k on V k and the predicted value is denoted as X 2k . After doing this for all K folds, we get the estimated values ofX for the whole sample 1, that is, With X 1 and X 2 , BF can be calculated as (ii) Modified variance estimator When performing penalized regression for variable selection, the choice of the regularization parameter λ is crucial for obtaining an accurate finite sample estimator. The value of λ influences both the number of variables selected and the extent to which their estimated coefficients are shrunk toward zero. If λ is set too large, not all signal variables will be selected, resulting in rapidly degrading performance (mainly characterized by a significant upward bias) as the true β becomes less sparse with a larger signal per element. Conversely, if λ is set too small, many noise variables will be selected, which allows spurious correlations to decrease our variance estimate, leading to considerable downward bias. Based on the simulation result in (40), there is a balance to be maintained when selecting the appropriate λ: whereŝ λ is the number of nonzero elements inb at the regulation parameter λ selected with K-fold (usually 5-10) cross-validation. Then we have: Frontiers in Nutrition frontiersin.org . /fnut. .

(iii) Refitted cross-validation estimator (RCV)
This estimator is derived from the RCV procedure proposed by (37). We first split the dataset into two roughly equal parts: [X (1) , W (1) , V (1) ] and [X (2) , W (2) , V (2) ]. We then perform penalized regression and RF on the first part. For penalized regression, we fit Lasso or SCAD on W and V with cross-validatedλ to obtain the non-zero estimated coefficients for W and V. In the case of RF, we select the 10 most important variables based on the residual sum of squares (RSS). We then refit the model with the selected W and V to obtain the post selected estimators of their coefficients, denoted asβ PS(1) WV. Subsequently, using the selected W and V in W (2) and V (2) , we can compute the following variance estimate on the second part.
whereŝ (1) is the number of selected variables in the first part.
Repeating the mirror image procedure on [X (2) , W (2) s , V (2) s ], we can obtainλ 2 , selected W obtained from Lasso in the second part and Var 2 (X|W, V). Last, BF can be derived as below: With BF, we haveX 2i =X 1i / BF. We can run a regression of X 2 on self-reported food frequency questionnaire data (Q) and V to build calibration equation using the n 2 calibration samples to estimate the parameter Using the Stage 3 sample, we then estimate Z as Z 2i = (1, Q i , V ⊤ i ) γ 2 for i = n 1 + n 2 + 1, · · · , n 1 + n 2 + n 3 . Finally, we estimate the association of Z with the time-to-event endpoint (T * , ) by solving the score equation for Cox model: For method 2 to work, we can relax Equation (3)  . . Method : three-step with self-reported data If the self-reported data Q from the feeding study is accessible and we presume that the distributions of (Q|Z, V) remain consistent between the controlled feeding study and the cohort, the bias in the naive estimator can be rectified by simply incorporating Q into the biomarker development equation. This is because the The sequence of the first method remains unchanged, but in the first step of the regression model, the log-transformed self-reported food frequency questionnaire data (Q) is included. Specifically, for the first step, the predictors W, V, and Q are utilized to construct the biomarker. Following this, in the second step, we employ W, V, and Q to estimate Z. Lasso, SCAD, and RF, as previously described, are all applied in Method 3 for variable selection and effect estimation in high-dimensional statistical inference, considering both direct selection and post-selection.
With the estimatedβ 3 from the first step,X 3i we can execute a regression ofX 3i on the self-reported food frequency questionnaire data (Q) and V to construct a calibration equation using the n 2 calibration samples to estimate the parameter Using the Stage 3 sample, we then estimate Z as Z 3i = (1, Q i , V ⊤ i ) γ 3 for i = n 1 + n 2 + 1, · · · , n 1 + n 2 + n 3 . Finally, we estimate the association of Z with the time-to-event endpoint (T * , ) by solving the score equation for Cox model: For method 3 to work, we can relax Equation (3) to that the conditional mean E[Z|W, Q, V] from sample 2 is the same as the conditional mean E[X|W, Q, V] from sample 1.

. . Method : direct estimation
We build the estimating equation by regressingX on Q and V in the first step and directly apply it to the third step. Then we build the calibration equation using the feeding study by regressingX on V and Q and use the calibration equation to predict Z and perform a Cox regression of Y on Z and V in the full cohort to estimate the association parameter. In other words, we have Frontiers in Nutrition frontiersin.org . /fnut. .
For method 4 to work, we can relax Equation (3)

. Simulation
We simulate data with varying levels of sparsity, effect size, and shape within the context of high-dimensional statistical inference. Our goal is to investigate how the sparsity, effect size, and shape among different measurements influence the bias and variance of various estimators. We compare the bias, empirical standard deviation (SD), estimated standard error (SE), and coverage rate for a nominal 95% confidence interval (CR) across different sample sizes, effect shapes, effect sizes, and correlation structures. Here the CR is computed from the asymptotic SE formula as shown in the Theorem 1 of (20) with the termˆ γ k estimated from 100 Bootstrap samples using data from the first two samples given that there is no closed-form variance formula for γ k when W is of highdimension. We examine both scenarios, with and without penalties applied to the personal characteristics V during the first stage of penalized regression. Time-to-event outcomes are generated using the Cox model.
where Z, V, X, and Q ∈ R, while W ∈ R p is high-dimensional. In this study, ǫ x and ǫ q are independently sampled from normal distributions with mean zero and standard deviations σ x and σ q .
. /fnut. . The censoring time is sampled from a mixture of a uniform distribution Unif(0, 10) and a point mass at 10, with equal probability. Three settings are considered: (i) baseline setting, (ii) weak biomarker effect and strong self-reported data effect, and (iii) strong biomarker effect. We experiment with three sparsity levels of W (2, 5, and 10), and consider two different patterns of the effect size for W: equivalent and random. More details on the parameter settings can be found in Supplementary material (Section 1.1). The bias, mean estimated standard error (SE), empirical standard deviation (SD), and coverage rate (CR) of 95% nominal confidence interval for all four methods from 100 simulations are listed in Tables 1, 2 with Lasso penalized  . /fnut. . In general, the post selection methods perform slightly better than the direct-selection methods with lower SDs and SEs. For a few settings, the direct selection approach does not perform stably in terms of bias and SD. The direct selection not forcing the inclusion of personal characteristics showed more stable results compared with direct selection forcing the inclusion of personal characteristics for Method 2 but the variance is larger in general for all other methods. For the post selection approach, the performances of the three variance estimation methods, are shown as 2.1 (K-fold cross-validation), 2.2 (modified variance estimator), and 2.3 (RCV) in Tables 2, 4, 6. Method 2.3 (RCV estimation under post selection within Method 2) performs the best among all three approaches across different settings and patterns. Some key advantages of Method 2.3 include lower bias and smaller standard deviations (SD) and standard errors (SE), along with good coverage rates (CR).
Tables 1, 2 shows the results using Lasso penalized regression when forcing personal characteristics in the model. As the sparsity level increases, the performance of most methods seems to degrade, with higher biases and lower coverage rates. Methods 3 and 4 demonstrate good performances in most of the settings. However, when the strength of the biomarker is strong and the strength of FFQ is relatively weak (i,e., Setting 3), we can see Method 2 generally generated the most efficient result compared with Methods 3 and 4. When we have strong biomarker effects (Setting 3), Method 2.3 outperforms the other methods.
Tables 3, 4 present the results for SCAD penalized regression when personal characteristics are forced into the model. Corresponding results without forcing personal characteristics can be found in Supplementary Tables 4, 5. When comparing SCAD with Lasso, we can observe a similar trend in terms of bias control and standard deviation (SD) across various effect size patterns and settings. In addition, when comparing the three approaches for variance estimation in constructing the bias factor (BF) using Method 2 with post selection, the RCV approach continues to . /fnut. . outperform the others in controlling bias and providing the most efficient results. With direct selection using SCAD in Method 2, the bias is generally well-controlled, and the confidence rate (CR) is promising when personal characteristics are not fixed for variable filtering. These results are comparable to those obtained with Lasso.
However, when variables are post selected, SCAD's performance is not as strong as Lasso's. This is particularly noticeable in scenarios with large sparsity, where SCAD struggles to control bias effectively. In summary, Lasso demonstrates superior performance in variance estimation and bias control when compared to SCAD.
. /fnut. . RF offers an alternative approach for constructing a biomarker prediction model in the second stage. Tables 5, 6 display the results obtained using RF. When the 10 most important variables are directly selected with RF, the estimated bias is considerably large in most scenarios. However, when post selection is applied to variables using RF, results with variance estimation approaches 2.2 and 2.3 both exhibit small bias and promising confidence rates (CR). These outcomes are comparable to those achieved with Lasso for Method 2 using the RCV estimation (2.3). In summary, while RF does not provide accurate estimations of associated parameters when using direct selection, its performance is similar to Lasso when post selection is employed.
Overall, with the linear settings, Lasso provides a consistent estimator in most cases and largely attenuates the bias compared with SCAD and RF. For more general model settings, RF has potential advantages when the linear model does not hold. The post selection option with RCV variance estimation for BF construction provides consistent estimation on associated parameters with stable CR and is recommended especially when we have sparse highdimensional data structure.

. Data analysis
We exemplify our methodologies utilizing data from the WHI NPAAS feeding study (n = 153), NPAAS biomarker study (n = 450), and the comprehensive WHI cohort data [comprising the WHI Observational Study (OS) and the Dietary Modification Trial Control Arm (DM-C), n = 122, 970]. A log-transformed selfreported ratio of sodium to potassium intake from FFQ serves as Q. Covariates such as age, BMI, race/ethnicity, education level, selfreported physical activity, and smoking status are considered as V. The high-dimensional 24 h urine measurements, acquired via nuclear magnetic resonance (NMR) and gas chromatography-mass spectrometry (GC-MS) platforms, are denoted as W. The disease outcome under consideration is total cardiovascular disease (CVD). The prevalence of CVD events is <10% (41), suggesting that the rare disease assumption is not substantially violated. Follow-up times commence at the moment of FFQ measurement (year-1 visit in DM-C and at enrollment in OS) and persist until the earliest of the specific CVD outcomes under consideration, death, loss to follow-up, or September 30, 2010, whichever occurs first.
. /fnut. . In our analytical process, hazard rates are modeled as implicitly conditioned on the continued survival of the study subject. This implies that death is not viewed as a source of censoring in our formulation. Rather, death merely constrains the follow-up period during which hazard rate information is collected for the subject. This differs from considering death as censoring non-fatal outcomes, which would be the case in a competing risk formulation. We scrutinized the normality of the log-transformed self-reported intake (Q), the log-transformed metabolites from 24-hour urine measurements (W), and the log-transformed evaluated sodium/potassium ratio (X) utilizing the NPAAS-FS .
/fnut. . The estimated HR and corresponding 95% confidence interval according to a 20% increase in the sodium-potassium ratio are shown in Table 7 for the methods of Lasso, SCAD, and RF.
We observe that the estimated HR is >1 in all cases, indicating a higher risk of CVD with increased sodium-to-potassium ratios, regardless of the different high-dimensional approaches used. These findings are consistent with those reported in previously published studies (20). The most conservative estimate forσ 2 x , 0, is used to construct the BF in Method 2. Moreover, RCV variance estimation is employed to construct the BF for the post selection approach with Method 2. The estimation of the associated parameter derived from Method 2 is smaller in scale compared to Method 1 and is similar to Methods 3 and 4. We note that the 95% CI does not include an HR of 1 with Lasso and RF in most cases, indicating a significant association between calibrated dietary intake and the risk of CVD. Conversely, the 95% CI with SCAD exhibits less efficient results with larger variance, indicating a nonsignificant association between calibrated dietary intake and the risk of CVD in several instances.

. Discussion
We investigated the prerequisites for a valid biomarker in highdimensional space for regression calibration purposes. Various methods to handle high-dimensional data (i.e., Lasso, SCAD, and RF) and approaches to variable selection (i.e., direct and post selection) were applied and compared across different scenarios, such as sparsity level and pattern of effect size. This paper offers researchers a comprehensive understanding of how to handle highdimensional data in calibrated regression studies. Building linear regression models in high-dimensional space presents challenges, such as overfitting to samples and multicollinearity, which can lead to inadequate estimations.
In order to identify the most effective measurements associated with consumed dietary intakes in the feeding study, Lasso, SCAD, and RF were applied for variable selection within the highdimensional dataset. Overall, Lasso demonstrated more stable results for variable selection compared to the other two approaches. Method 2, with the BF constructed using RCV estimation under the Lasso post selection approach, consistently provided good estimations in most cases.
It is worth noting that various factors, such as filtering conditions and methods for obtaining tuning parameters, can influence the accuracy of the biomarker prediction model when using penalized regression methods and RF. Depending on these choices, the accuracy of the estimated association parameters can vary significantly. Consequently, researchers should carefully consider these factors to achieve the most accurate and reliable results when working with high-dimensional data in calibrated regression studies.
Identifying effective measurements associated with consumed dietary intakes is crucial for biomarker construction. Statistical inference presents challenges with penalized estimators. In this paper, a bootstrapping approach was employed for variance estimation in high-dimensional data for penalized regression and RF. However, there are alternative approaches for variance estimation in high-dimensional data with penalized regression that could be considered in future analyses.
One issue with the estimated covariance matrix relates to zero components. Specifically, when coefficients are zero, the approximate covariance matrix results in zero for estimated variance. Although the estimation of non-zero components is robust, the signs of zero components can be either negative or positive. This issue is also present in the sandwich formula of the covariance matrix developed by (31). Wasserman and Roeder (43) proposed a two-stage procedure for valid inference. Their method involves randomly dividing the data into training and testing datasets. Penalized linear regression is used in the training data to select informative variables in the first stage, while ordinary least squares (OLS) are applied in the testing data to compute standard errors. A drawback of the single-split method is that results may depend on how the data is split. To address this, Meinshausen et al. (44) suggested a multi-split method, which repeats the single-split multiple times. Lockhart et al. (45) introduced the covariance test statistic to test the significance of predictor variables that enter the current Lasso model. For ultra-high-dimensional cases where the sample size is equal to or smaller than the variable dimension, the sure independent screening (SIS) technique proposed by (31) can be considered for variable screening in future work.

Data availability statement
The data analyzed in this study is subject to the following licenses/restrictions: the data can only be accessed through the collaborative mode as described on the Women's Health Initiative website. Requests to access these datasets should be directed to www.whi.org.