Applying the estimand and target trial frameworks to external control analyses using observational data: a case study in the solid tumor setting

Introduction: In causal inference, the correct formulation of the scientific question of interest is a crucial step. The purpose of this study was to apply causal inference principles to external control analysis using observational data and illustrate the process to define the estimand attributes. Methods: This study compared long-term survival outcomes of a pooled set of three previously reported randomized phase 3 trials studying patients with metastatic non-small cell lung cancer receiving front-line chemotherapy and similar patients treated with front-line chemotherapy as part of routine clinical care. Causal inference frameworks were applied to define the estimand aligned with the research question and select the estimator to estimate the estimand of interest. Results: The estimand attributes of the ideal trial were defined using the estimand framework. The target trial framework was used to address specific issues in defining the estimand attributes using observational data from a nationwide electronic health record-derived de-identified database. The two frameworks combined allow to clearly define the estimand and the aligned estimator while accounting for key baseline confounders, index date, and receipt of subsequent therapies. The hazard ratio estimate (point estimate with 95% confidence interval) comparing the randomized clinical trial pooled control arm with the external control was close to 1, which is indicative of similar survival between the two arms. Discussion: The proposed combined framework provides clarity on the causal contrast of interest and the estimator to adopt, and thus facilitates design and interpretation of the analyses.


Introduction
Several causal inference frameworks including the estimand framework (EF), target trial emulation framework (TTF) and PICO frameworks exist to help define a precise scientific question for comparative assessments in clinical research and development (i.e., whenever a treatment effect is estimated or a hypothesis related to a treatment effect is tested) (1) . There are overlapping but complementary elements in these frameworks, suggesting the potential for a combined application. However, this presents challenges to investigators as there are limited practical examples and guidance for combined application of the frameworks.
The EF is one such framework which has increasingly been adopted by health authorities and pharmaceutical companies since initial publication in August 2017. (2) The EF enables specifying a precise scientific question by using five attributes that define the estimand (i.e., the treatment effect of interest or the "what to estimate"). These five interrelated attributes are population, treatment, variable of interest (endpoint), intercurrent event handling, and the summary measure. An intercurrent event is an event occurring after treatment initiation that affects either the interpretation or the existence of the measurements associated with the endpoint. For example, if performing a comparative assessment on overall survival between two different treatments, candidates for intercurrent events include, among others, early discontinuation of treatment or treatment switching after disease progression. In general, the definition of the estimand comes first and is derived from the scientific objective of the trial or study. Together with considerations about missing data, the framework then informs the choice of the estimator.
The addendum acknowledges that usually an iterative process will be necessary to reach an estimand that is of clinical relevance for decision making and for which a reliable estimate can be computed. If it turns out to not be possible to develop an appropriate trial design or to derive an adequately reliable estimate for a particular estimand, an alternative estimand, trial design, and method of analysis may need to be considered. However, practical examples in the literature describing such an iterative process to redefine an initial target estimand and also considering aspects of identifiability (and hence the estimator) are limited. While the focus of the ICH E9 addendum is on randomized clinical trials (RCTs), the principles are also applicable whenever estimating a treatment effect (i.e., also for single arm trials and observational studies).
However, estimation of a causal effect from observational data often has additional challenges compared to an RCT. These include bias due to baseline confounding, selection bias, missing data and defining the index date for comparison.
The TTF is another causal framework that can be used to more precisely specify the scientific question in a comparative assessment. (3) TTF complements the EF by addressing gaps related to the analysis of observational data and is the application of design principles of an RCT to the specific setting of a non-randomized comparative assessment. (3)(4)(5)(6) TTF entails defining a hypothetical randomized trial to address a precise scientific question, and then further specifying how it can be emulated (i.e., approximated) by non-randomized data. The essential components of a target trial protocol are eligibility criteria, treatment strategies, treatment assignment, start and end of follow-up, outcomes, causal contrasts, and the analysis approach (estimator). (3) The framework can also be utilized when a combination of clinical trial and observational data is used, for example to contextualize a single arm clinical trial with observational data. (7) Combining the EF and the TTF provides a structured approach to enhance the scientific rigor for causal inference for observational and/or non-randomized data. Together they bring more transparency on the causal estimand, supporting specifying the attributes of the estimand and the assumptions to be made to draw causal conclusions.
Another framework that aims to define the precise scientific question include the PICO framework, (8) that has traditionally been used in epidemiology for observational studies. The EF and TTF both extend the PICO framework, with the former adding intercurrent events and the population-level summary measure, and the latter adding the causal contrast, assignment procedures, and the start/end of follow-up.
In this study we jointly apply the EF and TTF to perform a comparative effectiveness assessment in patients with non-small cell lung cancer (NSCLC) using data from a set of pooled control arms of three RCTs (9)(10)(11) as well as electronic health record (EHR)-derived de-identified observational data. The objective of our case study was to determine whether there is a difference in overall survival (OS) between patients with metastatic NSCLC receiving front-line chemotherapy in pivotal trials vs patients with metastatic NSCLC who received front-line chemotherapy as part of routine care, had patients not received a subsequent therapy. This case study contributes to the important goal in pharmacoepidemiology of assessing whether observational data (including EHR-derived) can emulate (and thus supplement or replace, e.g. for regulatory decision-making) the control arm of a RCT. The iterative process (as indicated in the EF) to arrive at the final scientific question is illustrated in the Methods section. The case study illustrates the applicability of the EF to observational data, and furthermore complements the EF with the TTF to account for specific challenges in observational data not directly addressed by the EF (and vice versa, as the handling of intercurrent events is not explicitly addressed in the TTF). In sum, the present study provides insights into where the two frameworks are complementary, and provides a practical example of jointly applying them.

Methods
Applying the frameworks to the research question Before discussing details on the joint application of the EF and TTF to the final scientific question, we want to provide insights into the iterations to arrive at the final question as we hope this would provide practical guidance on the iterative process outlined in the EF. We were interested in comparing the treatment effect of the same front-line treatment given in a clinical trial and in clinical practice when subsequent treatments would be similar. We started with the scientific question: "Is there a difference in overall survival (OS) between patients with metastatic NSCLC receiving front-line chemotherapy in pivotal trials vs patients with metastatic NSCLC who received front-line chemotherapy as part of routine care?". EHR-derived observational data from routine clinical practice suggest a larger heterogeneity in subsequent second-line cancer treatments as compared to a clinical trial setting. (12) This difference in the range of potential subsequent therapies may introduce complexities in estimating causal treatment effects for longer-term outcomes such as OS and ultimately complicate interpretation.
As a consequence, the initial research question has been iterated to: "Is there a difference in overall survival (OS) between patients with metastatic NSCLC receiving front-line chemotherapy in pivotal trials vs patients with metastatic NSCLC who received front-line chemotherapy as part of routine care, had patients not received a subsequent therapy?". Hence, instead of considering the entire treatment strategy (front-line and subsequent therapy) which is complicated by heterogeneity in subsequent therapies among clinical trial and clinical practice settings, the iteration resulted in the scientific question of treatment effect of the front-line regimes. Now we focus on the joint application of the EF and TTF to the final scientific question. Table 1 displays the EF/TTF attributes that define the estimand aligned with the scientific research question. We define the hypothetical target trial structured according to the estimand framework, and we define the study that attempts to emulate it leveraging elements from the EF and TTF.
The average treatment effect on the treated (ATT) is the estimand of primary interest. This is the treatment effect difference of using front-line chemotherapy in a clinical trial versus in clinical practice, and hence the target population is defined by the clinical trials population.

Clinical trial data
Individual patient-level data (PLD) were used from Roche-sponsored phase III, open-label randomized clinical trials IMpower130 (ClinicalTrials.gov identifier: NCT02367781), 131 (ClinicalTrials.gov identifier: NCT02367794) and 132 (ClinicalTrials.gov identifier: NCT02657434). Methods and primary findings have been previously reported. (9-11) These three trials included patients who were chemotherapy-naive and had stage IV NSCLC. OS was the primary endpoint for the three trials. To address the objective of the present study only the PLD from the control arms were used. The control arms received platinum-based chemotherapy as follows: • IMpower130 included patients with non-squamous NSCLC treated with carboplatin plus nab-paclitaxel • IMpower131 included patients with squamous NSCLC treated with carboplatin plus nabpaclitaxel • IMpower132 included patients with non-squamous NSCLC treated with carboplatin or cisplatin plus pemetrexed As these three clinical trial control arms had similar settings in terms of disease, therapy, and inclusion/exclusion criteria and also had similar survival outcomes such as median survival time (Appendix Figure 1), they were pooled together to increase the sample size and are collectively referred to as the RCT arm in this study.

Observational data
The observational comparator (OC) arm of this study was developed using the nationwide Flatiron Health EHR-derived de-identified database. This longitudinal database is comprised of patient-level structured data (e.g., laboratory values and prescribed treatments) and unstructured data (e.g., biomarker reports) curated from technology-enabled chart abstraction from physicians' notes and other documents. During the study period, the de-identified data originated from approximately 280 cancer clinics (approximately 800 sites of care, primarily community-based cancer centers). (13,14) Institutional Review Board approval of the study protocol was obtained prior to study conduct, and included a waiver of informed consent.
Cohort selection/Study sample The OC cohort was selected to align to the clinical trials eligibility (inclusion/exclusion) criteria of the three trials. To be eligible for entry into the Flatiron Health Research Database, the patient's record must include >1 visit to a community oncology clinic documented in the EHR, and have confirmation of advanced NSCLC diagnosis and histological subtype (squamous vs. nonsquamous histology) through a review of unstructured data (ie, clinical notes, radiology reports, or pathology reports). A start date of front-line therapy for advanced or metastatic NSCLC on or after April 16, 2015 and on or before May 31, 2017 to match the clinical trials start and end dates of enrollment was also required. Patients with an ECOG performance status (PS) of 0, 1, or unknown were included. Patients had to have received at least one administration of regimens of interest (i.e., carboplatin plus paclitaxel/nab-paclitaxel, carboplatin or cisplatin plus pemetrexed). Patients who had potentially incomplete historical treatment data (i.e., >90-day gap between advanced diagnosis and structured activity in the EHR), therapy within 6 month prior to start of front-line therapy for advanced stage disease, receipt of clinical study drug or multiple primary tumors were excluded. Patients with missing information or known to have a sensitizing mutation in the EGFR gene or an ALK fusion oncogene were excluded. All patients were followed until July 18th 2019. Detailed inclusion/exclusion criteria were included in Appendix Table 5.

Statistical analyses
The following estimation approach to target the ATT estimand with attributes as specified in Table 1 has been applied. First, the inverse probability of treatment weighting (IPTW) method was used to balance baseline patient characteristics between the trial arm and the external control arm. A multiple logistic model was used to estimate propensity scores that are defined as probabilities of being assigned to the RCT group conditional on all confounders (Appendix Table 4) that were selected based on clinical experts' knowledge and availability of the relevant variables. Given that we target the ATT as outlined above, patients from clinical trials were given a weight of one while weights for OC patients were defined as the ratio of the estimated propensity score (PS) to one minus the estimated PS (i.e., odds of being treated in the clinical setting). After IPTW weighting, differences in baseline characteristics were assessed through standardized mean differences (SMD) ( Table 2). Patient characteristics were considered balanced if SMD < 0.10. (15) The weighted population was used in the subsequent analyses.
Secondly, we artificially censored patients at the time of receipt of first second-line treatment and used inverse probability of censoring weighting (IPCW) method to estimate weights for the follow-up information for the remaining patients using both baseline and time-varying variables which are likely to impact treatment switching based on clinical experts' knowledge to adjust for any potential confounding created by the artificial censoring. Specifically, we fit a Cox model for each group that was used to estimate the probability of not being censored by time t given baseline and time-varying covariates listed in Table 7 for the specific group. The IPCW weights are calculated as the inverse of the conditional probability of not being censored. We truncated the follow-up time at Month 21 because there were few patients remaining in the RCT group after Month 21 and thus the positivity assumption was unlikely to hold. Then, in order to reduce variance of the weighted estimator, we calculated the stabilized IPCW weight (15) which is the probability of not being censored conditional on selected baseline covariates divided by the probability of not being censored conditional on both baseline and time-varying covariates.
Standardized weighted mean and proportion difference (SMD) was used to assess differences in censoring confounders between the two groups after weighting using the same definition of balance as above.
The treatment effects were estimated using weighted survival analysis methods. Specifically, we estimated the hazard ratio (HR) using a IPTW-IPCW weighted Cox proportional hazard model and the 95% CI for the HR using bootstrap approach. (16) We also used the IPTW-IPCW weighted Kaplan-Meier method to compute OS survival function estimates and weighted logrank test to compare across groups. Hence the double weighting estimation approach targets the ATT estimand with attributes of the EF & TTF framework as specified in Table 1.
Missing values for covariates with missing rate less than 30% were imputed using median (for age and time from initial diagnosis to index date) or mode (smoking history). Covariates with more than 30% of values missing (i.e., ECOG PS) were excluded from the analysis. We performed a sensitivity analysis by analyzing the whole follow-up time periods for RCT and OC groups instead of truncating them at Month 21. Also, to evaluate to what extent our estimation methods remove the potential bias on OS due to baseline confounders and intercurrent events, we performed the traditional IPTW-only method that adjusts for baseline characteristics but not intercurrent events in terms of K-M estimate and HR and compared to our proposed method. To follow the structure of the Estimand framework, we consider this IPTW-only estimation as a supplementary analysis because it estimates an estimand different from our target estimand. R (3.6.1) was used for the analyses.

Cohort characteristics
A total of 849 patients were in the RCT arm and 3,340 patients were in the OC arm (refer to Appendix Table 2 for the OC cohort attrition table). Demographic and clinical characteristics of the study sample at baseline are presented in Table 2 (and in Appendix Table 3 stratified by RCT). Significant differences between the RCT and OC arms were observed in age, gender, race, ECOG-PS, tumor diagnosis type (de novo Stage IV/recurrent disease), histology, time from initial diagnosis to index date, and treatment type. Patients in the OC arm were older, with higher percentage of females, races other than White and Asian, recurrent disease and nonsquamous histology, with shorter time from initial diagnosis to index date, and less frequently treated with carboplatin plus paclitaxel/nab-paclitaxel.
The percentage of patients who switched to subsequent antineoplastic treatment was higher in the OC arm compared to the RCT arm (56.3% vs 52.9%; Table 3) during the whole follow-up period. Among patients who switched, the median time to treatment switch was shorter in the OC arm compared to pooled RCT arm (5.45 vs 6.24 months; 55.8% vs 46.1% switched in the first 6 months). Differences in pre-specified confounders for treatment switching including age, histology, treatment type, and progression were observed. Specifically, we saw a higher percentage of switching among patients with squamous and progression events during the follow-up period (Table 4).

Main analyses
A logistic model was fitted (Appendix Table 4) to account for imbalances between the RCT and OC arms on baseline characteristics and estimate the propensity score (PS). Then IPTW weights were calculated using the propensity scores estimated from the logistic model and we excluded a small percentage of patients (0.4%) with extreme weights (weight > 10) in the OC arm to avoid undesirable variability in estimates due to extremely large weights. (17) SMDs for patient variables were all below 0.1 (18) after IPTW ( Figure 1) suggesting balance achieved on the selected baseline characteristics through IPTW weighting.
Patients were artificially censored at the time of treatment switching, then the censoring mechanism was modeled via a Cox regression model and the probability of not being censored conditional on patient/clinical characteristics that were pre-specified was estimated (Table 4).
The stabilized IPCW weights were calculated as the ratio of inverse of the probability of not being censored conditional on race only and the probability of not being censored conditional on the age, race, histology and progression. Here, different from the traditional stabilized weights, race was added to both the numerator and denominator to further increase the stability of the IPCW weight. (19) In order to make stable estimation and reduce variability, extreme weights were trimmed at 99th percentile for the OC arm and 98th percentile for the RCT arm. After implementing both IPTW and IPCW, SMDs on the majority of baseline and time-varying confounders were reduced to values below 0.1 (Figure 2), and slightly larger than 0.1 for race, tumor type, and histology.
After accounting for treatment setting assignment at baseline and treatment switching using IPTW-IPCW method, the HR estimated from the weighted Cox model was equal to 0.94 (95% CI: [0.77, 1.13]), which suggests comparable overall survival between the RCT and OC arms.
Weighted Kaplan-Meier estimates of survival functions ( Figure 3) overall were comparable, however there was crossing hazard between two arms. The two curves align well at months 7-14, while RCT performed better at month 0-6 and worse at month 15-23. The difference in median survival time between the two arms was small (9.9 month with 95% CI: [8.6, 12.3] for OC vs 10.9 month with 95% CI: [9.6, 12.5] for RCT). These results suggested that after accounting for imbalances of baseline characteristics and removing the confounding effects of treatment switching, patients in the OC arm had similar OS as those in the RCT arm.

Sensitivity analyses
A sensitivity analysis was performed to analyze the entire follow-up time period (i.e., no truncation) for the RCT and OC arms. The HR was 0.93 (95% CI: [0.77, 1.13]), which was similar to the primary analysis results. However, there were wider confidence intervals for K-M curves after month 21 for both the RCT and OC arms ( Figure 4) due to the low number of events.

Supplemental analyses
In supplemental analysis, we performed an IPTW-only analysis that adjusted for baseline characteristics only by IPTW weighting but without IPCW. This is a commonly-used method in analyses of external control arms, resulting in a different estimand compared to the primary analysis. Although the HR was similar with primary analysis 0.92 (95% CI: [0.81, 1.05]), there was a larger discrepancy in Kaplan-Meier estimates between RCT and OC, especially during Month 6 and 14 compared to the primary analysis.

Discussion
In this study we applied the EF and TTF to define a precise scientific question in comparativeeffectiveness research. As a case study to assess the feasibility of using observational data to construct an external control arm to single arm trials, we conducted a retrospective cohort study to compare OS among patients with metastatic NSCLC exposed to front-line chemotherapy in RCTs versus routine clinical practice settings, while accounting for differences in subsequent treatments between these settings. To achieve this objective, we pooled clinical trial patients from the control arms of three RCTs (IMPOWER 130, 131, and 132) and developed an OC cohort derived from de-identified EHR data obtained from routine clinical practice. OS was compared between the two arms, assuming a hypothetical scenario wherein patients in neither setting received subsequent therapy after the first-line chemotherapy. We found no difference in OS between the two arms. When accounting for baseline confounding as well as differences in patterns of subsequent treatments in clinical trial and routine clinical practice care patients, the long-term outcome of first-line treatment for patients with metastatic NSCLC is similar.
Our approach attempts to clarify the causal contrast of interest by combining elements of the EF and TTF. The EF and TTF serve complementary purposes in answering the scientific question.
As formulated by Hernán and Robins, the TTF ensures that an appropriate comparative study is designed to help estimate the causal effect from the observed data (3) . While the causal contrast can be specified within the TTF, the EF adds clarity to the causal contrast through the explicit consideration of intercurrent events (i.e., events occurring post-baseline that can affect the assessment of treatment effects). Combined, the EF and TTF improve the transparency in the: 1) target of estimation (causal contrast), 2) assumptions and data needed to identify the causal contrast, and 3) limitations of available data.
To our knowledge, there are a limited number of studies that combine the EF and TTF. (2022) (20) combined the EF and TTF using routine clinical care data to generate an external control arm. The approach described in our study adds to the limited number of use cases by accounting for a scenario where patterns of subsequent treatments are different between the sources of clinical trial and routine clinical care. We anticipate that many researchers will likely encounter this scenario in applications involving real-world external controls. Our study, unlike other studies, also illustrates the iterative nature of specifying an estimand. In practice, such iteration allows a comprehensive and transparent dialogue among stakeholders to reach consensus on the scientific question and its tractability given the available data (i.e., discern the identifiability of the estimand).

Recently, Hampson et al
Strengths of this study include the combination of the EF and TTF, as well as large sample size with extensive follow-up and a high proportion of patients with an event of interest. Furthermore, death events were ascertained with high accuracy in both RCT and observational data settings. (21) Lastly, model diagnostics indicated that the weights from IPTW and IPCW induced a good balance in the measured baseline and post-baseline confounders.
Limitations of the study include that because data were pooled from disparate sources, full information was not available for all possible confounders. For example, there was limited capture of comorbidities, sites of metastasis, and smoking status within the OC arm compared to the RCT arm. The assumption of no unmeasured confounders underlies both IPCW (i.e., baseline as well as time-varying covariates jointly predicting treatment switch and outcome; (22) as well as IPTW (i.e., baseline covariates jointly predicting treatment setting and outcome).
About 43% (Table 2) of the patients in routine clinical care included in our study had missing ECOG-PS at the start of front-line therapy, some of whom may have had an ECOG-PS value above 1. For context, among adults with NSCLC who received first-line chemotherapy in the real-world setting, 13.6% had an ECOG-PS greater than 1 (Appendix Table 2). A second limitation was that the definition of time-zero differed across the RCT and OC arms. Time-zero was the date of randomization in the clinical trials as compared to the date of treatment initiation in the routine clinical practice care cohort. The impact is believed to be small given that typically treatment was initiated within a few days post-randomization. A third limitation is that patients in IMpower trials were global while patients in the OC arm were from the United State only.
Although we account for potential patient confounders in our models, there could be residual confounding effects due to the difference in the region. A final limitation was that we pooled data from the control arms of the RCTs and hence assumed negligible heterogeneity in outcomes among the three clinical-trial cohorts. However, we believe trial heterogeneity posed little bias risk to our study because the survival estimates for each of the three trials were similar (Appendix Figure 1).
In conclusion, this study demonstrated that combining the EF and TTF approaches can improve the rigor in the design and analysis of comparative effectiveness studies, including retrospective observational studies. The EF approach alone does not suffice in specifying a study design, and the TTF alone can leave ambiguity in the inferential target. The combination of the two frameworks should be considered more often by researchers.