High-Dimensional Mediation Analysis With Confounders in Survival Models

Mediation analysis is a common statistical method for investigating the mechanism of environmental exposures on health outcomes. Previous studies have extended mediation models with a single mediator to high-dimensional mediators selection. It is often assumed that there are no confounders that influence the relations among the exposure, mediator, and outcome. This is not realistic for the observational studies. To accommodate the potential confounders, we propose a concise and efficient high-dimensional mediation analysis procedure using the propensity score for adjustment. Results from simulation studies demonstrate the proposed procedure has good performance in mediator selection and effect estimation compared with methods that ignore all confounders. Of note, as the sample size increases, the performance of variable selection and mediation effect estimation is as well as the results shown in the method which include all confounders as covariates in the mediation model. By applying this procedure to a TCGA lung cancer data set, we find that lung cancer patients who had serious smoking history have increased the risk of death via the methylation markers cg21926276 and cg20707991 with significant hazard ratios of 1.2093 (95% CI: 1.2019–1.2167) and 1.1388 (95% CI: 1.1339–1.1438), respectively.


INTRODUCTION
Mediation analysis was firstly used to deal with the causal chain of events as the primary exposure has an effect on the outcome through affecting one or more mediators in psychological studies, and gradually extended to sociological and biomedical researches (Baron and Kenny, 1986;MacKinnon et al., 2002;Preacher and Hayes, 2008;Biesanz et al., 2010;Huan et al., 2016). Of note, the mediators are usually measured after the intervention, but before the main outcome of interest. Mediation effect is often assessed through a regression-based analysis procedure by decomposing the total effect that describes the relationship between the exposure and the outcome variable into direct effect and indirect effect (Baron and Kenny, 1986;MacKinnon et al., 2007). In the past couple of decades, the topic of mediation analysis has received a great deal of attention, particularly in the area of causal inference (Robins and Greenland, 1992;Ten Have et al., 2007;Albert, 2008;Sobel, 2008;VanderWeele, 2009;Pearl, 2014). Researches in mediation analysis have been generalized from the case of a single mediator to multiple mediators (Albert and Nelson, 2011;Zhang and Wang, 2013;VanderWeele and Vansteelandt, 2014;Daniel et al., 2015), even to the case of high-dimensional mediators (Huang and Pan, 2016;Zhang et al., 2016;Zhao and Luo, 2016;Chén et al., 2018;Sohn and Li, 2019;van Kesteren and Oberski, 2019;Zhao et al., 2020). Recently, much progress has been made in extensive of mediation methods to survival models (Lange and Hansen, 2011;VanderWeele, 2011;Wang and Zhang, 2011;Huang and Yang, 2017).
The regression-based or structural equation modeling approach is commonly used to assess mediation effect. This approach assumes that there are no confounders influencing the relationships among exposure and mediator, mediator and outcome, and exposure and outcome. Randomization to levels of the exposure guarantees that there are no confounders that influence both the relation of exposure-mediator and exposure-outcome. However, the assumption that individuals are randomly assigned to exposure, especially for research about smoking and lung cancer, is difficult to achieve.
Propensity score method can be used to solve such a problem with a non-randomized exposure which usually appears in observational studies (Rosenbaum and Rubin, 1983). Previous studies have focused on mediation analysis with confounders in the case of a single mediator. For example, Valente et al. (2017) introduced confounders in mediation analysis and described how to address confounders with design-based techniques and analysis-based approaches. Coffman (2011) proposed to use the calculated propensity score to adjust for confounders between the mediator and the outcomes. However, methods for highdimensional mediation selection adjusting for confounders, especially for survival outcome, are still yet to be developed.
For example, in a lung cancer study, it is showed that smoking increases the risk of lung cancer patients' progression to death through DNA methylation markers (Luo et al., 2020). However, as an observational (or non-randomized) study, it is unrealistic for a subject to be randomly assigned to the exposure, as moral and ethical factors, in the research of how smoking affects the lung cancer patients' risk of progression to death mediated by DNA methylations. Therefore, the relationship among smoking status, DNA methylations, and overall survival may be confounded by baseline characteristics, such as age, gender, and other physical health indicators. However, highdimensional mediation analysis for survival analysis subject to confounders is still to be developed.
In this paper, we study mediator selection and indirect effect estimation via high-dimensional mediation analysis in survival models with confounders. For observational studies, as the exposure is not randomly assigned, we propose to use the propensity score approach to adjust confounding effects. The key ideas are as follows. Firstly, we adjust for baseline confounders based on the calculated propensity score which serves as a covariate in the mediation models. Secondly, we reduce the dimension of potential mediators from ultra high-dimensional to moderate (i.e., one that is less than the sample size) using sure independence screening (SIS) method (Fan and Lv, 2008). Thirdly, we conduct variable selection via Cox proportional hazards model with the minimax concave penalty (MCP) (Zhang, 2010). Finally, we carry out the Sobel and joint significance test for mediation effect.
The rest of the paper proceeds as follows. In the next part, we introduce the notations and models, definition of propensity score, and develop the proposed procedure. Then, we provide simulation studies to evaluate the performance of our proposed procedure and a real data application to analyze the mediation effects of high-dimensional DNA methylation markers on the causal effect of smoking on lung cancer in an epigenomewide study. Finally, we conclude the paper through discussing limitations and other feasibilities.

Notations and Models
For individual i, i = 1, 2, · · · , n, we let D i denote the time from onset to an event (death) and C i be the potential censoring time. The observed survival time is T i = min(D i , C i ), and the failure indicator is δ i = I(D i ≤ C i ), where I(·) is an indicator function. Let X i be the exposure (smoking status, i.e., smoker or non-smoker), M i = (M 1i , M 2i , · · · , M pi ) T be a p-dimensional continuous mediator vector (including all the methylation information), p n. In observational studies, the assumption that no confounders influence the relation of exposure-mediator, mediator-outcome, and exposure-outcome is violated. Let Z = (Z 1 , · · · , Z m ) T denotes for the baseline confounders. Figure 1 illustrates how confounders Z influence the relation of X − M, M − Y, and X − Y.
For survival outcome (Cox, 1972), the high-dimensional mediation models with confounders can be expressed as follows, where Eq. (1) is the Cox proportional hazards model which describes the relationship between the exposure X, mediators M and the time-to-event variable; Eq.
(2) characterizes how the exposure variables influence the mediators; λ 0 (t) is the baseline hazard function; γ * is the direct effect of the exposure on the outcome; β = (β 1 , · · · , β p ) T is the coefficient vector relating the mediators to the outcome adjusting for the effect of exposure and confounders; ϕ = (ϕ 1 , · · · , ϕ m ) T is the coefficient vector relating the confounders to the outcome; FIGURE 1 | The directed acyclic graph describes high-dimensional mediation with confounders affecting the relation among exposure, mediator, and outcome. α = (α 1 , · · · , α p ) T is the coefficient vector relating the exposure to the mediators; φ k = (φ k1 , · · · , φ km ) T is the coefficient vector relating the confounders to the mediator; c k is the intercept term; e ki ∼ N(0, σ 2 ) is the residual.

Propensity Score
The propensity score is proposed to help remove the selection bias result from potential confounders of X (Rosenbaum and Rubin, 1982). The propensity score is defined as the probability that an individual i, i = 1, · · · , n be allocated to the treatment group often estimated using logistic regression models, π i = Pr (X i = 1 | Z 1i , · · · , Z mi ), given measured confounders Z = (Z 1 , · · · , Z m ) T . This method is often used to minimize the influence of observed baseline covariates on the exposure. There are many propensity-based techniques for estimating average causal effect, including sub-classification (Rosenbaum and Rubin, 1984), matching (Rosenbaum and Rubin, 1985), and inverse propensity weighting (Robins et al., 1995). In this article, we focus on incorporating the calculated propensity score as the covariate to adjusting the confounding effects. According to Rosenbaum and Rubin (1984), the propensity score is assessed by using baseline measured confounders as covariates in a logistic regression model with treatment status as the outcome as following logit (P (X i = 1)) = θ 0 + θ 1 Z 1i + · · · + θ m Z mi , where θ = (θ 1 , · · · , θ m ) T denotes the coefficients of confounders on the exposure, and θ 0 denotes the intercept. Hence, the propensity score, π i , the probability to be assigned to the intervention group can be expressed as .
The superiorities of propensity score over the classical regression adjustment method have been described elsewhere for the nonmediation model (Schafer and Joseph, 2008;VanderWeele, 2010). Briefly, propensity score approaches allow the inclusion of a large scale of confounders through reducing the potential covariates into a single numerical summary. More importantly, the comparison between subjects in treatment group and control group who have the same propensity score equals the comparison of control conditions with randomly assigned (Rosenbaum and Rubin, 1982).

Methodology
Since the assumption of no confounders affecting the relation among exposure, mediator and outcome is violated in observational researches, we propose a new method using the propensity score as a covariate in the high-dimensional mediation model as follows, where π i is the covariate of calculated propensity score; ϕ is the effect of the covariate on the outcome;φ k is the effect of the covariate on the mediator. We will compare this with the method of adjusting all confounders as covariates and the method of ignoring confounders. The goal of variable selection is to identify S = k : α k β k = 0 , which are the significant mediators between the exposure and the outcome when the number of potential mediators p is much larger than the sample size n, and the traditional statistics methods for Cox regression analysis fail to work (Luo et al., 2020). Besides, there are confounders influence the relationship of exposure, mediators, and outcome. To solve this problem, we propose the following procedure for highdimensional mediation analysis with confounders in survival models. The overall workflow is as follows (Figure 2): Step 0: We first construct the propensity score of confounders through a logistic regression model of exposure vs. baseline confounders, and use it as a covariate in the mediation models.
Step 1: For k = 1, · · · , p, we select a subset S 1 = k: 1 ≤ k ≤ p of size d = 2n/log (n) based on SIS method, where · is the ceiling function (Fan and Lv, 2008). For the mediators in S 1 are among the top d strongest P-values for the response variable. SIS procedure has been a general technique to reduce dimensionality from high to a small scale that is below the sample size. Here we use d = 2n/log (n) instead of d = n/log (n) to increase the probability for identifying important mediators, considering that both α k and β k have to be selected as nonzero to ensure a specific mediator to be selected.
Step 2: Among all the screened mediators M k , k ∈ S 1 from Step 1, we further identify the subset S 2 = k: β k = 0 via MCPbased Cox model. We obtain mediators M k through the penalized log-partial likelihood optimization with shape parameter a > 1. Breheny and Huang (2011) implemented the MCP procedure with the R package ncvreg.
Step 3: For k ∈ S 2 , a variable M k is considered as a mediator between the exposure and outcome only if the indirect effect is significant. Here, we considered two methods to test the mediation effects, including the Sobel test (i.e., product method; Sobel, 1982) and the joint significant test.
Followed with the Sobel test for indirect effect, we have the P-value for testing the null hypothesis H 0 : α k β k = 0 of no indirect effect where σ α k β k is the estimate of the Sobel standard error (SE) (Sobel, 1982); α k is the ordinary least squares estimator for α k ; β k is the estimate of β k , by refitting regression Eq. (3) with the mediators obtained in step 2. The joint significant test for indirect effect is based on the path-specific (i.e., X → M and M → Y) P-values (MacKinnon et al., 2002) and does not provide an estimate. The P-value for testing H 0 : α k = 0 is given as and the P-value for testing H 0 : β k = 0 is Thus, the P-value for the joint significance test is defined as We have the revised P-value via the Bonferroni's method in order to adjust for multiple comparisons where |S 2 | is the number of elements in set S 2 . Hence, we can reject the null hypothesis of no IE k if P k < 0.05, and conclude that the variable M k is the significant mediator between the exposure and outcome. Remark 1: Luo et al. (2020) proposed a compositional mediation framework to identify biomarkers which mediate the influence of smoking on lung cancer survival with highdimensional candidates. They used a regression-based approach, which relies on the assumptions that there are no confounders that influence the relations between exposure and mediator, and exposure and outcome. This assumption holds if subjects are randomly assigned to levels of exposure, but generally random assignment is not possible in observational studies. We propose the use of propensity scores to adjust for confounders in highdimensional mediation analysis in survival models.
Remark 2: Our method has three advantages. First, different from Luo et al. (2020), our approach is a simultaneous inference for high-dimensional mediation analysis with multiple confounders in survival models implemented with a propensity score method. The propensity score method can help remove the selection bias that may result when subjects are not randomly assigned to levels of exposure in observational studies. Second, compared with regression adjustment approach that include the confounders in mediation model directly, our method is more concise since we focus on incorporating the logit propensity score as a covariate in the mediation analysis. An advantage of propensity scores is that they reduce multiple potential confounders into a single numerical summary. Third, our method has a substantial improvement over method that does not include propensity scores.

SIMULATION STUDIES
In this section, we evaluate the performance of the proposed mediator selection and mediation effect estimation method through simulation studies. In order to investigate how the sample size impacts the performance, three sample size levels (N = 300, N = 500, N = 1, 000) are presented with potential mediators number p = 10, 000. For each scenario, 500 replications of simulated data sets are conducted. Besides, we also consider two censoring rate settings of 15 and 30%.
To summarize, only the first four mediators have significant mediation effects, which satisfy the condition of α k β k = 0. In this part, we conduct a comparison of our proposed method with the other two approaches, including models ignoring confounders (Naïve approach) and models adjusting all 10 confounders as covariates (Z approach). We use the proposed procedure to identify significant mediators and estimate mediation effects, where the proposed approach uses the logit propensity score estimated through logistic regression as the covariate to adjust for confounding effects. Through the simulation studies, we want to demonstrate that propensity score methods can be used to adjust for confounding in the high-dimensional mediation selection and estimation. PS, method of using the propensity score as the covariate; Naïve, method of ignoring confounders; Z approach, method of using confounders as covariates. TPR, true positive rates; FP, false positive; FDP, false discovery proportion (=V/R, where V is the number of false discoveries and R is the number of total discoveries); all the three measures are the average value over 500 times.
Frontiers in Genetics | www.frontiersin.org  PS, method of using the propensity score as the covariate; Naïve, method of ignoring confounders; Z approach, method of using confounders as covariates. MSE, mean square error; -, means the not available value.  PS, method of using the propensity score as the covariate; Naïve, method of ignoring confounders; Z approach, method of using confounders as covariates. MSE, mean square error; -, means the not available value.

Frontiers in Genetics | www.frontiersin.org
Simulation results are presented in Tables 1-3. Table 1 evaluates the performance of mediator selection of the proposed approach in comparison to the other two approaches using the true positive rate (TPR), the number of false positive (FP), and false discovery proportion (FDP) of selection after the significance test for mediation effects based on the joint and the Sobel methods. The TPR of the proposed propensity score approach is lower than the Z approach when the sample size is 300, but performs similarly to the Z approach as the sample size increases. And the proposed method has lower FP and FDP rates than the Z approach. The Naïve approach has lower TPR and higher FP and FDP rates, indicating the deficiency in identifying significant mediators due to confounding effects. Take sample size 500 as an example, the FP and FDP rates based on the joint test are 0.004 and 0.0008 for the proposed approach; 0.026 and 0.0056 for the Z approach; and 4.342 and 0.5168 for the Naïve approach. Selection results based on the joint test are similar. Besides, as the censoring rate increases, the TPR rates decrease, especially for the lower sample size. Similar results can be seen for the setting with a 30% censoring rate.
Tables 2, 3 show the estimation of mediation effects with censoring rate by 15 and 30%, respectively. The bias of the indirect effect estimator using the PS approach is very small. The Naïve approach is biased severely. It is important to note that the proposed method even has slightly better performance than the Z approach including all confounders as covariates in the estimation of indirect effects.
In summary, the results demonstrate that the bias of the mediation effect estimator of our proposed methods for highdimensional mediation analysis using the calculated propensity score to adjust confounding influence is nearly unbiased. Besides, with the increase of sample size, the ability in mediator selection including TPR, the number of FP, and FDP shows good performance as well as the Z approach. The Naïve approach ignoring the confounders produces a severe bias in both mediator selection and mediation effects estimation. Compared with the classical regression method for mediation analysis with confounders, the procedure we proposed is more concise and efficient.

REAL DATA ANALYSIS
As we know, smoking is an important risk factor for lung cancer, one of the deadliest cancer worldwide (Herbst et al., 2008). With the development of sequencing technology, both Illumina Infinium HumanMethylation27 and HumanMethylation450 are widely used platforms that allow measuring highdimensional DNA methylation levels of roughly 27 and 450 k respectively (Bibikova et al., 2011). As the individual level phenotype and genotype data are available, researchers have indicated that methylation markers are acting as mediators between smoking and lung function or lung cancer patient's overall survival (Zhang et al., 2016;Luo et al., 2020). The TCGA (The Cancer Genome Atlas) lung cancer cohort study had been used for mediation analysis to identify the methylation markers (Luo et al., 2020). However, the assumption that samples are randomly assigned to the smoking or non-smoking group is violated. Hence, it is of great importance to adjust for confounding effects when conducting high-dimensional mediation analysis.
We apply the proposed method using the calculated propensity score as a covariate in high-dimensional mediation analysis with survival outcome to a lung cancer dataset including lung squamous cell carcinoma and lung adenocarcinoma. There are 1,299 lung cancer patients aged 33-90 years and 907 of them had DNA methylation profile measured using the Illumina Infinium HumanMethylation 450 platform. DNA methylation values were recorded for each array probe in each sample via BeadStudio software. A total of 365,307 probes were included in the analysis.
To identify the potential methylation mediators between the tobacco smoking and the overall survival, we apply the highdimensional mediator model with smoking status assessed at their initial diagnosis (smoker/non-smoker) as the exposure variable, DNA methylation measured concurrently as the highdimensional mediators, and the survival time as the outcome variable. The overall survival time is defined as the number of days from the initial diagnosis to the death or the last followup date. Subjects with no observed time, exposure, and other covariates are excluded; there are 696 patients with 269 deaths left. Covariates including age at initial diagnosis, gender, and radiotherapy (yes/no) are considered.
We first adjust for the baseline confounders including age, gender, and radiotherapy using the calculated propensity score. Due to the fact that the relationships between methylation and the outcome are much stronger than those between exposure and methylation in the analysis data set, we add top d = 2n/log(n) CpGs using SIS method based on the path from smoking to the methylation in order to improve the probability to recognize significant mediators. Then, we run a variable selection on the CpGs screened in the above step. Finally, we carry out the significance test for the mediation effects.
The analysis results are presented in Table 4. We identify CpGs mediating the relationship between smoking and the overall survival of lung cancer patients with Bonferroni's adjusted P < 0.05. Since smoking generally increases the risk of progression to death and reduces the overall survival of lung cancer patients with the total effect of 1.3436 (95% CI: 1.0377-1.7400), we focus on the mediators with the log-hazard indirect effect αβ = 0 (smoking increases the mortality). Our method finds two CpGs (cg21926276 and cg20707991) mediating the relationship of smoking and risk of progression to death, while methods including all confounders as covariates and methods ignoring confounders only find cg20707991 to be a significant mediator. The methylation site cg21926276 has been reported as a mediator of smoking and the risk of progression to death (Luo et al., 2020). All the two genes in which methylation sites locate are associated with lung cancer or tumor growth in previous studies. For example, the gene H19 (cg21926276 locate) is related to both lung cancer and tumor growth, methylation of which has been thought of as a sensitive marker of tobacco history (Bouwland-Both et al., 2015;Matouk et al., 2015). The gene PTPRN2 (cg20707991 locate) is also associated with lung cancer and survival of cancer patients (Anglim et al., 2008;Wielscher et al., 2015). Besides confirming the previously reported genes, cg20707991 is identified as a novel marker for the survival of lung cancer patients. The CpGs are the DNA methylation sites. Chromosomes and Genes are where the CpGs locate. α is the estimation of the effect of exposure on methylation. β is the estimation of the effect of methylation on the risk of progression to death. P(Sobel) is the Sobel test P-values and P(Joint) is the joint test P-values, which are corrected by Bonferroni's method.
Based on the above analysis, compared with non-smokers, the risk of death for those smokers is 1.3436 (95% CI: 1.0377-1.7400). Mediation analysis using Cox proportional hazards model discovers that the effect of having serious smoking history on the increased risk of progression to death is mediated through methylation markers including cg21926276 and cg20707991; the hazard ratio for each mediator is 1.2093 (95% CI: 1.2019-1.2167) and 1.1388 (95% CI: 1.1339-1.1438), respectively. Interventions can be explored on these markers to improve medical care for the detection and treatment of lung cancer among smokers.
To sum up, through the mediation analysis of smoking, DNA methylation, and the survival time of the lung cancer patients, we found two CpGs mediating the smoking and the mortality. Our findings not only were in line with previous studies which found that the gene that CpGs locate were important biomarkers for lung cancer, but also uncovered the mediation role of the markers connecting the smoking exposure and the survival time.

DISCUSSION
The motivation of this study is that the assumption of no confounders affecting the relationship of exposure, mediators and outcome in the classical mediation model is difficult to be satisfied with observational studies. Hence, how to adjust these confounders is an important and practical question. The propensity score method can summarize a large scale of confounders into a single value which is more concise than the methods with a regression adjustment for all the potential confounders. Thus, motivated by the above facts, we develop a new method that using the propensity score as a covariate to adjust for confounding effects in high-dimensional mediation models.
In this article, we focus on how to adjust for confounding influences when the exposure is not randomly assigned in observational studies. We propose a new model for highdimensional mediation analysis using propensity score methods to adjust for confounding effects. To identify the significant mediators from high-dimensional potential candidate variables, we mainly combine the sure screening technique, MCP-based penalty, and Sobel and joint methods for significance tests. We evaluate the performance of the proposed procedure via several simulation studies and a real data application.
Compared with the mediation analysis which includes all the confounders as covariates, our proposed approach for high-dimensional mediation analysis using the calculated propensity score to adjust confounding influence would be an improvement in mediator selection and indirect effect estimation. The simulation results also show that the proposed method can obtain a nearly unbiased estimation for indirect effects. It is also interesting to note that if confounders are omitted from the model, then the estimates for mediation effects will be severely biased. In conclusion, we suggest using the calculated propensity score to adjust for confounders among the exposure, mediators, and the outcome when evaluating mediation.
As mentioned previously, propensity score methods have many other applications, such as matching, weighting, and sub-classification. It is of interest to explore the performance of high-dimensional mediation selection and estimators using propensity score weighting. Also, the propensity score in our current approach is only valid for single exposure. Analysis approach for the high-dimensional mediators with more than two exposure status is still to be developed. The present simulation results do not address the cases that confounders only affect mediators and the outcome. It is of future interest to developed methods involving estimating propensity score for high-dimensional mediators. Of note, the Sobel test and the joint significant test we used are conservative, which paves the way for developing a more powerful test method, such as the Divide-Aggregate Composite null Test (DACT; Liu et al., 2021). The DACT method is especially useful for the composite null hypothesis of no mediation effect in largescale genome-wide epigenetic studies. It is desirable to consider such a powerful test method for mediation effects in the future research.

DATA AVAILABILITY STATEMENT
The TCGA (The Cancer Genome Atlas) lung cancer data we used in our real data analysis can be found in (https://xenabrowser. net/) without limitation. Our procedure is implemented using the R tool. The corresponding R code can be found at https://github. com/luo-chengwen/HIMAsurvival-PS.