Surrogate endpoints for overall survival in randomized clinical trials testing immune checkpoint inhibitors: a systematic review and meta-analysis

Introduction There is debate on which are the best surrogate endpoint and metric to capture treatment effect on overall survival (OS) in RCTs testing immune-checkpoint inhibitors (ICIs). Methods We systematically searched for RCTs testing ICIs in patients with advanced solid tumors. Inclusion criteria were: RCTs i) assessing PD-(L)1 and CTLA-4 inhibitors either as monotherapy or in combination with another ICI, and/or targeted therapy, and/or chemotherapy, in patients with advanced solid tumors; ii) randomizing at least 100 patients. We performed a meta-analysis of RCTs to compare the surrogacy value of PFS and modified-PFS (mPFS) for OS in RCTs testing ICIs, when the treatment effect is measured by the hazard ratio (HR) for OS, and by the HR and the ratio of restricted mean survival time (rRMST) for PFS and mPFS. Results 61 RCTs (67 treatment comparisons and 36,034 patients) were included in the analysis. In comparisons testing ICI plus chemotherapy, HRPFS and HRmPFS both had a strong surrogacy value (R2 = 0.74 and R2 = 0.81, respectively). In comparisons testing ICI as monotherapy, HRPFS was the best surrogate, although having a moderate correlation (R2 = 0.58). In comparisons testing ICI plus other treatment(s), the associations were very weak for all the surrogate endpoints and treatment effect measures, with R2 ranging from 0.01 to 0.22. Conclusion In RCTs testing ICIs, the value of potential surrogates for HROS was strongly affected by the type of treatment(s) tested. The evidence available supports HRPFS as the best surrogate, and disproves the use of alternative endpoints, such as the mPFS, or treatment effect measures, such as the RMST.


Introduction
Overall survival (OS) is the gold-standard endpoint used to demonstrate the clinical efficacy of new cancer drugs in randomized clinical trials (RCTs).The primary effect measure of interest is the ratio of the hazards of death, namely the OS hazard ratio (HR OS ), assessed over the entire follow-up period (FUP) and estimated using the proportional hazards (PH) Cox model.
However, a reliable estimation of HR OS requires large RCTs with long FUP, resulting in increase in costs and time required before a new cancer drug is available to patients.To expedite drug approvals, the evaluation of new treatments in RCTs often relies on the assessment of their effects on surrogate endpoints, under the assumption that these effects accurately predict those on OS at the final analysis (1).
Progression-free survival (PFS) has been long used as a surrogate endpoint for OS in RCTs testing chemotherapy and targeted therapy in patients with advanced solid tumors.Also, the HR estimated from a Cox PH model for PFS (HR PFS ) is routinely used as a measure to empirically compare experimental versus control arms.
The weak correlation (R 2 <0.40) between HR PFS and HR OS resulting in RCTs testing immune checkpoint inhibitors (ICIs) (2,3) may challenge the belief that HR PFS is a potential surrogate endpoint for OS.However, the poor correlation between HR PFS and HR OS may be attributed to ICIs' novel mechanisms for activating or rehabilitating self-immunity against tumors, which could result in delayed clinical effects and long-term responders, as well as in disease progression followed by tumor shrinkage (pseudoprogression).In these instances, the PFS curves may take some time to separate, and the immunotherapy agent curve may have a long tail, leading to the violation of the PH assumption on which the calculation of HR PFS is based (4).
The restricted mean survival time (RMST), namely the mean survival time to some prespecified time point t*, was proposed as an alternative treatment effect measure to address both delayed response and long-term responders issues, accounting for deviation from PH assumption (5).The treatment effect on PFS can be measured as the ratio of RMST (rRMST), which is the ratio of the area under the Kaplan-Meier (KM) PFS curve for the control group vs experimental group, from time 0 to a chosen time t*.
Early overlapping PFS curves may also depend on pseudoprogression events, a documented type of response to ICIs that occur when an initial apparent RECIST-defined progression is observed prior to eventual disease improvement.To consider pseudo-progressions, Wang et al. (6) recently proposed a novel endpoint, the modified PFS (mPFS), which omits the events of disease progression (but not deaths) within n months (e.g., 3 months) after randomization, showing that HR mPFS outperformed HR PFS as surrogate for HR OS in ICI trials.
Here, we performed a systematic review and meta-analysis of RCTs testing ICIs in patients with advanced solid tumors to compare the surrogacy value for HR OS of both PFS and mPFS as endpoints and HR and rRMST as treatment effect measures, in strata of type of treatment administered in the experimental arm [i.e., ICI alone, ICI plus chemotherapy, ICI plus ICI or other treatment(s)].

Methods
The value of PFS and mPFS as surrogate endpoint for OS in RCTs testing ICIs was assessed using a meta-analytical approach based on pseudo individual patient-level data (IPD) (see details below).The treatment effect was measured by the HR for OS, and by the HR and the rRMST for the two surrogate endpoints.
We included RCTs: i) assessing PD-1, PD-L1 and CTLA-4 inhibitors either as monotherapy or in combination with another ICI, and/or targeted therapy, and/or anti-angiogenesis drugs, and/ or chemotherapy, in patients with advanced solid tumors; ii) randomizing at least 100 patients; iii) displaying the KM survival curves for OS and PFS.
We excluded single-arm phase I and II trials (i.e., nonrandomized trials), RCTs conducted in (neo)adjuvant setting or in hematologic malignancies, and RCTs considering ICIs as control arm (either monotherapy or combined with other therapies).
Titles, abstracts, and full-text articles were reviewed independently by four authors (FC, LP, EP, IS).Inconsistencies were discussed by all authors to reach consensus.Reference lists of articles included in the final selection were reviewed to identify additional relevant papers.When duplicate publications were identified, only the most recent and complete were included.
Based on a predefined form, we extracted data on the following variables: study name, first author and year of publication, study design and blinding, trial phase, primary endpoint(s), underlying malignancy, number of patients, median FUP time, line of therapy, type of experimental and control treatment.

Quality assessment of trials
To ascertain risk of bias, we assessed the methodological quality of each trial using the Cochrane Risk of bias tool (version 5.2.0) (9).Responses in each domain (random sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessment, incomplete outcome data and selective outcome reporting) were assessed as having a 'low', 'unclear' or 'high' risk of bias.

Individual patient-level data reconstruction
Pseudo IPD for PFS and OS were reconstructed from the published KM curves.We used a web based validated tool [WebPlotDigitizer (10)] to extract data coordinates from published KM curves.Then, pseudo IPD were reconstructed using the validated algorithm proposed by Guyot et al. (11).
To derive mPFS, disjointed PFS and OS pseudo IPD were matched using a simulation-based algorithm, as described in Wang et al. (6).Briefly, the algorithm matches PFS-OS pseudo IPD under the following conditions: i) for a given patient, the PFS duration should not exceed the OS duration; ii) patients with events in the OS pseudo IPD dataset should be a subgroup of patients with events in the PFS pseudo IPD dataset.Given that these requirements are insufficient to accurately capture the original matched PFS-OS IPD, we simulated 1000 qualified datasets of matched PFS-OS pseudo IPD for each included treatment arm.

Statistical analysis
The data extracted from the included three-arm trials were treated as two separate comparisons, with the control arm being duplicated in both comparisons.For this reason, our unit of analysis was the comparison between pairs of treatment arms, and not the trial.Each pairwise comparison was categorized according to the type of treatment administered in the experimental arm: ICI alone, ICI plus chemotherapy, or ICI plus ICI or other treatment(s).
For each comparison, we used the reconstructed pseudo IPD to estimate HR OS , HR PFS , HR mPFS , rRMST PFS , rRMST mPFS with their 95% confidence intervals (95% CI).rRMST was obtained as the ratio of the RMST in the control group to the RMST in the experimental group.Both an rRMST and an HR less than one favored the experimental treatment.
Within each treatment comparison, we compared differences in treatment effect measures by using the ratio of the surrogate effect measure (s; i.e., HR PFS , HR mPFS , rRMST PFS , or rRMST mPFS ) to the final effect measure (f; i.e., HR OS ).A ratio s f < 1 indicated that the treatment effect size on the surrogate measure overestimates the treatment effect size on the final measure.
The pairwise agreement between the statistical significance of HR OS and each surrogate measure was assessed.In our analysis a pvalue below 0.05 was considered statistically significant, which would in turn indicate the potential for drug approval.Cohen's kappa coefficient was used to measure the agreement between HR OS and each surrogate.Additionally, McNemar's statistic was used to test the null hypothesis that both measures provided a similar proportion of statistically significant findings.
Meta-analyses of the treatment effect measures were conducted using random effects models.A weighted linear regression (WLS) model was used to quantify the association between the treatment effect on the final endpoint (HR OS ) and each surrogate measure (HR PFS , HR mPFS , rRMST PFS , and rRMST mPFS ).The coefficient of determination (R 2 ) was used to quantify the surrogacy value at triallevel of each potential surrogate endpoint.The 95% CI for R 2 was estimated by bootstrap analysis with 1000 samples.According to ReSEEM guidelines (8), R 2 values equal to or higher than 0.7 represent strong correlations (and was therefore suggestive of surrogacy), values between 0.69 and 0.5 represent moderate correlations, and values lower than 0.5 represent weak correlations.The slope of the regression line was also reported as an alternative measure of surrogacy.For the treatment effects to be associated, we required that the slope significantly differed from zero.Finally, we calculated the surrogate threshold effect (STE), defined as the minimum treatment effect on the surrogate endpoint necessary to predict a significant OS benefit in a future trial.
A comprehensive description of statistical analysis was included in the Supplementary Material.

Patient and public involvement
Members of the study group have regular meetings with patient representatives about ongoing scientific projects and activities.During these meetings the project and its objectives are discussed, and we accepted the patients' suggestions, which were mainly focused on the need to make the final version of the paper as clear and less technical as possible, to widely disseminate the results given the relevant implications for research and clinical practice.

Results
Overall, 61 RCTs comprising a total of 36,034 patients were included in the analysis (Supplementary Figure S2, Supplementary Table S1).
Table 1 shows the main features of included trials .Six phase I/II or II or II/III trials (10%) and 55 phase III trials (90%) were included.The publication years spanned from 2011 to 2021, with 29 (48%) published in 2021.Forty-one studies (67%) were in the first-line setting and 27 (44%) enrolled patients with non-small cell lung cancer.OS alone was the primary endpoint in 25 (41%) trials, PFS alone in 12 trials (20%), OS and PFS or OS and overall response rate (ORR) as co-primary in 22 (36%) and 1 trial (2%), respectively.The sole phase I/II included trial used ORR alone as primary endpoint.
The median FUP of trials was 18.9 months, ranging from 5.1 to 69.5 months.
Six trials had three treatment arms, for a total of 67 comparisons analyzed (Table 2).Thirty comparisons (45%) tested ICI alone, 22 comparisons (33%) evaluated the combination of ICI with chemotherapy, and 15 comparisons (22%) tested other ICI(s)containing combinations, including an anti-PD-(L)1 combined with anti-angiogenesis drugs in 5 trials, the combination of an anti-PD-(L)1 with an anti-CTLA-4 drug in 7 trials, comboimmunotherapy (i.e., anti-PD-(L)1 plus anti-CTLA-4) plus chemotherapy in 2 trials, and an anti-PD-L1 combined with targeted therapy in one trial.The treatment administered in the control arm was chemotherapy alone in 56 comparisons (84%), anti-angiogenesis agent alone or in combination with chemotherapy in 6 comparisons (9%), targeted therapy in 2 comparisons (3%) and placebo in 3 comparisons (5%).Chemotherapy was administered as control arm in all the comparisons testing ICI plus chemotherapy, in 27 (90%) comparisons testing ICI alone, and in 7 (47%) comparisons testing ICI plus ICI or other treatment(s).
The proportional hazards assumption was not rejected (i.e.Grambsch-Therneau test p-value greater than 0.05) for both OS and PFS in 31 out of 67 comparisons (46%), for OS only in 15 comparisons (22%), for PFS only in 2 comparisons (3%), and neither for OS nor PFS in 19 comparisons (28%).
The PH assumption held for both OS and PFS in 82% of comparisons testing ICI plus chemotherapy, in 53% of comparisons testing ICI plus ICI or other treatment(s), and in only 5 (17%) comparisons testing ICI alone.
Supplementary Table S2 reports the quality assessment of trials according to the Cochrane Risk of bias tool.Overall, the quality of trials was high, as the risks of selection, attrition, reporting and other forms of bias for all the RCTs included in the analysis were low.The only potential biases affecting trials were performance and  S3 reports the treatment effect estimates and their ratios derived from pseudo IPD extracted from the published KM curves for each included comparison.The HRs for OS ranged between 0.43 and 1.03 in comparisons testing ICI alone, between 0.56 and 1.26 in comparisons testing ICI plus CT, and between 0.59 and 0.86 in comparisons testing ICI plus ICI or other treatment(s).Notably, pooled HR OS were very similar across strata: HR OS =0.77 (95% CI: 0.73, 0.82) for ICI alone, HR OS =0.78 (95% CI: 0.73, 0.84) for ICI plus chemotherapy, and HR OS =0.78 (95% CI: 0.73, 0.82) for ICI plus ICI or other treatment(s) (Figure 1).
Across the 67 comparisons, the time horizons t* defining RMST ranged from 8.7 months to 5.1 years (median 20.9 months [IQR: 15.0, 26.0]) and no important differences emerged between strata.
Figure 1 shows the pooled ratio between each surrogate treatment effect measure and HR OS by type of treatment administered in the experimental arm.
Notably, The HR PFS showed a larger treatment effect for the experimental treatment compared to that observed for HR OS (ratio of the two treatment effect measures <1) in 7 (23%) comparisons testing ICI alone, in 20 (91%) comparisons testing ICI plus chemotherapy, and in 8 (53%) comparisons testing ICI plus ICI or other treatment(s) (Supplementary Table S3).
Overall, the HR PFS significantly underestimated the protective treatment effect size observed on HR OS in comparisons testing ICI alone (pooled ratio HR PFS /HR OS =1.14, 95% CI: 1.06, 1.24) and, on the contrary, significantly overestimated it in comparisons testing the combination of ICI plus chemotherapy (pooled ratio HR PFS / HR OS =0.88, 95% CI: 0.83, 0.93).
Conversely, the HR mPFS significantly overestimated the protective treatment effect size observed on HR OS regardless of the type of treatment administered in the experimental arm (pooled ratio HR mPFS /HR OS was 0.90 (95% CI: 0.86, 0.95) for ICI alone, 0.93 (95% CI: 0.89, 0.97) for ICI plus chemotherapy, and 0.89 (95% CI: 0.83-0.95)for ICI plus ICI or other treatment(s); Figure 1).Supplementary Table S4 shows the pairwise agreement between the statistical significance of HR OS and each surrogate measure.In comparisons testing ICI alone, there was no agreement between statistical significance of HR OS and HR PFS (Cohen' Figure 2 shows the correlations between effects of ICIs on OS (yaxis) and the potential surrogate endpoints (x-axes; i.e., panel A: HR PFS , panel B: HR mPFS , panel C: rRMST PFS , and panel D: rRMST mPFS ) according to the type of treatment administered in the experimental arm.
In the ICI plus ICI or other treatment(s) strata, the associations were very weak for all the surrogate endpoints and treatment effect measures with the R 2 ranging from 0.01 to 0.22.
The surrogacy equations between the log-transformed treatment effects and the ln-HR OS estimated from the WLS regression, along with the R 2 coefficient, the prediction bands, and STE were displayed in Figure 3, Supplementary Figures S3  and S4 for ICI alone, ICI plus chemotherapy, and ICI plus ICI or other treatment(s), respectively.
The slope of the surrogacy WLS regression lines significantly differed from zero for all the analysis on ICI alone and ICI plus chemotherapy comparisons.

Discussion
There is controversy on the value of PFS as surrogate endpoint for OS in RCTs testing anticancer immunotherapy in patients with advanced solid tumors (2,3,79,80).Another issue of intense debate is the most adequate metric to capture the treatment effect on PFS, since the widely adopted HR relies on the PH assumption, which is frequently violated in immunotherapy trials (4).
To our knowledge, we reported here the most comprehensive and updated analysis exploring these two related issues: i) the value of the PFS as surrogate endpoint for OS in trials testing ICIs, as compared with the alternative endpoint represented by mPFS; ii) the suitability of the HR to measure the treatment effect on PFS, as compared with the alternative metric represented by RMST.
Our findings tried to increase the understanding of the aforementioned conflicting results regarding the surrogacy value of PFS (2,3,79,80).When all the RCTs were pooled together, none of the endpoints (i.e., PFS and mPFS) or metrics (i.e., HR and RMST) investigated had a strong association with HR OS.
For example, we found that the R 2 of the association between HR PFS and HR OS was 0.38 (95% CI: 0.23, 0.52).This means that only 38% of the variability among treatment effects on OS is explained by the effects observed on PFS, far from the R 2 cut-off value of 0.70, which is considered optimal for a candidate surrogate endpoint by international guidelines (8).
However, when comparisons testing different types of treatments were analyzed separately, we observed a moderate (i.e., 0.5≤R 2 ≤ 0.69) or a strong (i.e., R 2 ≥0.7) association between HR PFS and HR OS in the two groups of comparisons testing ICI alone or combined with chemotherapy, respectively.This discrepancy between results of overall and stratified analyses might be explained by the fact that the HR PFS and HR OS estimates were well aligned within the two groups, but along two different regression lines.The regression line fitting the HR PFS over HR OS in comparisons testing ICI alone had lower values of both the intercept (-0.22) and the slope (0.43), compared to those estimated in the ICI plus chemotherapy group (intercept=0.05,slope=0.77).
The slope indicates the steepness of the regression line and captures the average rate of HR OS change when HR PFS increases by 1 unit.Slope values near zero indicate that negligible HR OS changes occur as the HR PFS changes, while slope values that are progressively further from zero indicate rates of HR OS change increasingly larger.Our results suggests that in both groups of trials an improvement of PFS favoring the experimental arm translated into an OS improvement, but the amount of OS gain with the same treatment effect on PFS was different.As a matter of fact, even if the pooled HR OS was very similar in the two groups (0.77 [95% CI: 0.73, 0.82] and 0.78 [95% CI: 0.73, 0.84] for ICI alone or combined with chemotherapy, respectively), the pooled HR PFS resulted significantly heterogeneous (0.86 [95% CI: 0.77, 0.97] and 0.68 [95% CI: 0.62, 0.74], respectively).It might be explained by the fact that the biological effects and the impact on the natural history of disease of each type of treatment can differ substantially, and thus the amount of PFS improvement that eventually translates into an effect on OS is strictly treatment-dependent.
Homogeneity of treatment types and their biological mechanisms of action should be guaranteed in surrogacy analyses (8).If this principle is overlooked, even a strong association between the candidate surrogate and the true endpoint can be hidden, leading to the erroneous conclusion of absence of surrogacy.
Consistently, the absence of PFS surrogacy observed in the third group of comparisons [i.e., ICI plus ICI or other treatment(s)] can be attributed to the heterogeneous spectrum of therapies considered, including the combination of ICI with anti-angiogenesis agents, targeted therapies, or different types of ICIs.The limited number of RCTs available for each of these treatment types precluded the possibility to perform more specific analyses.Therefore, no conclusions can be drawn on the surrogacy value of PFS for immunotherapy strategies other than ICI alone or combined with chemotherapy.
Finally, our results showed that mPFS did not confer an advantage over PFS.
Concerning the issue of the best metric for measuring the treatment effect on PFS, it has been reported that the RMST can provide theoretical advantages over the HR.RMST is not affected by deviations from the PH assumption, which is common in immunotherapy trials due to the intrinsic mechanisms of action of ICIs that lead to delayed responses and long-term responders (4,81).
Although in 54% of included comparisons (36 out of 67) the PH assumption was violated (for PFS or OS only, or both), we unexpectedly found that the HR outperformed the rRMST, The surrogacy equation between the log-transformed treatment effects and the ln-HR OS estimated from the weighted linear regression, the R 2 coefficient, and the pooled ratio between surrogate endpoint and HR OS were also reported with their 95% CI (displayed in square brackets).yielding higher R 2 values for PFS over OS in all of the explored contexts.Notably, despite the fact that the largest difference favoring HR versus rRMST was observed in the group of trials testing ICI plus chemotherapy, where the PH assumption held in the majority of cases for both OS and PFS, no advantage for rRMST was observed in the other two groups of comparisons, where the PH assumption was largely violated.
Our analysis revealed that HR PFS is limited by the tendency to significantly underestimate the treatment effect size observed on HR OS in comparisons testing ICI alone (pooled ratio HR PFS /HR OS =1.14, 95% CI: 1.06, 1.24) and, on the contrary, to significantly overestimate it in comparisons testing the combination of ICI plus chemotherapy (pooled ratio HR PFS /HR OS =0.88, 95% CI: 0.83, 0.93).
This had in turn relevant impact on the STE of HR PFS , which is defined as the minimum value of the HR PFS necessary to be observed in a future RCT to confidently predict an OS benefit.Our results showed that a HR PFS lower than 1 could be enough to predict a significant OS benefit in RCTs testing ICI alone, while a HR PFS lower than 0.70 is required in trials testing ICI plus chemotherapy.
The overestimation of the final effect on HR OS by HR PFS is expected, since all the events considered in the OS endpoint are also included in the PFS.Similar findings were previously reported in surrogacy analyses on RCTs testing chemotherapy or targeted therapy (80, 82).In contrast, the underestimation of the final effect on HR OS by HR PFS observed in trials testing ICI alone is quite unusual and it could be due to the pseudo-progression events.These events are specifically observed in patients treated with ICI alone and they can lead to an erroneous and systematic underestimation of the ICI effect on PFS without affecting patients' OS.In accordance with this hypothesis, the underestimation of the treatment effect on OS in comparisons testing ICI alone was not observed when considering the mPFS that accounts for the pseudo-progression events.However, the weak value of the R 2 precluded the possibility to use such endpoint in this context.
It's worth noting that the incidence of pseudo-progression events has been reported to be higher among patients treated with anti-CTLA4 monotherapy as compared to anti-PD(L)1 monotherapy.Consequently, the surrogacy value of mPFS might be substantially higher in trials exclusively testing anti-CTLA4 treatments.However, the limited numbers of RCTs assessing anti-CTLA4 monotherapy precluded conducting a more detailed analysis.
The immune-related RECIST (iRECIST) are new response criteria specifically designed to assess the activity of immunotherapy, and to correctly categorize real versus pseudo-progression events (83).The adoption of iRECIST to categorize the events included in the PFS would probably improve its tendency to underestimate treatment effects on OS in trials testing ICI alone, but additional study and validation of iRECIST are required.
Our study has several strengths.It is the most comprehensive and updated analysis on such relevant topic, and it includes a large number of RCTs and patients.The efforts to reconstruct the IPD of more than 36,000 patients using a validated algorithm allowed to reliably assess and compare different endpoints and metrics.Also, the wide range of treatment effect estimates reported in the included trials for both PFS and OS contributed to ensuring adequate generalizability of the analysis.
The main limitation of our analysis is that it is based on reconstructed rather than original IPD.Original IPD allow for checking the plausibility of randomization sequences, verifying data integrity and consistency, fitting bivariate and copula-based models, which are among the preferred methods of assessing trial-level associations, adjusting the analyses for baseline prognostic covariates, and accounting for the fact that each within-trial surrogate outcome is estimated with error (84).Nevertheless, the specific goal of our analysis was to assess surrogacy at trial-level, and we used only data from highquality RCTs.Therefore, an analysis based on original IPD is unlikely to substantially change our conclusions.
Furthermore, results from the analysis stratified by type of treatment administrated in the experimental arm deserve further validation to be considered conclusive.Finally, it could be possible that the shape of the curves for PFS and thus its surrogacy value could meaningfully be affected by specific inclusion criteria used to select patients populations enrolled in RCTs, especially the enrichment for molecular and clinical biomarkers predictive of response to immunotherapy, such as expression levels of PDL1, tumor mutational burden, smoking habits and gender.The lack of original IPD precluded conducting such types of granular analyses.
A relevant consideration should be highlighted.Regulatory agencies consider the HR OS as the gold-standard measure for assessing treatment effects in RCTs and approving new drugs (85).As a result, we used it as the reference measure in our surrogacy analysis.In conclusion, our results showed that HR PFS had a strong surrogacy value for HR OS in comparisons testing ICI in combination with chemotherapy and moderate in comparisons testing ICI alone.Therefore, it should remain the reference surrogate endpoint in such contexts.Even in the presence of significant deviation from the PH assumption, the available evidence does not support the use of alternative endpoints, such as the mPFS, or metrics, such as the RMST.
Finally, two caveats should be highlighted.First, the available evidence does not allow for an adequate investigation of the value of HR PFS as surrogate for other types of immunotherapy treatment strategies.Second, when using treatment effects on HR PFS to predict those on HR OS , the tendency to either underestimate or overestimate the final OS should be taken into account, depending on the type of treatment under investigation.

FIGURE 1 Forest
FIGURE 1Forest plots showing meta-analytic pooled estimate (with 95% CI) of treatment effects on OS and potential surrogate endpoints, meta-analytic pooled estimate (with 95% CI) of the ratio between surrogate endpoint and HR OS , R 2 coefficient (with 95% CI) from the weighted linear regression and surrogate threshold effect (STE), by type of treatment administered in the experimental arm.The figure shows in the left panel the meta-analytic pooled estimate (circles) of treatment effects on potential surrogate endpoints, by type of treatment administered in the experimental arm.Horizontal lines indicate the 95% CI and the solid vertical line indicates a HR or rRMST of 1, which is the null-hypothesis value.Values <1 indicate a treatment effect in favor of the experimental arm, while values >1 indicate treatment effects in favor of the control.The meta-analytic pooled estimates of HR OS are also displayed.The central panel shows the meta-analytic pooled estimate (circles) of the ratio between surrogate endpoint and HR OS , by type of treatment administered in the experimental arm.Horizontal lines indicate the 95% CI, and the solid vertical line indicates a ratio of 1, which is the null-hypothesis value.Values <1 indicate a surrogate endpoint that overestimates the protective treatment effect size observed with HR OS , while values >1 indicate a surrogate endpoint that underestimates it.The right panel shows the R 2 coefficient (with 95% CI) estimated from the weighted linear regression model, by type of treatment administered in the experimental arm.Surrogate threshold effect (STE) values are also reported on the right.

2
FIGURE 2 Correlations between effects of ICIs on OS and the potential surrogate endpoints, PFS (A, C) and mPFS (B, D) by type of treatment administered in the experimental arm.The figure shows the correlations between effects of ICIs on OS and the potential surrogate endpoints, PFS (A, C) and mPFS (B, D) according to the type of treatment administered in the experimental arm [i.e., ICI alone, ICI plus chemotherapy, and ICI plus ICI or other treatment(s)].The treatment effects are measured by the HR for OS, and by the HR and the rRMST for the two surrogate endpoints.Each circle represents a comparison, and the surface area of the circle is proportional to the number of patients in the corresponding comparison.Red circles represent comparisons with ICI alone as experimental arm, blue circles represent comparisons with ICI plus chemotherapy as experimental arm, and green circles represent comparisons with ICI plus ICI or other treatment(s) as experimental arm.Dashed lines represent weighted regression lines.The R 2 coefficients, with their 95% CI (displayed in square brackets), were reported in the legend.

3
FIGURE 3 The correlations between effects of ICI alone on OS and the potential surrogate endpoints, PFS (A, C) and mPFS (B, D).The figure shows the correlations between effects of ICI alone on OS and the potential surrogate endpoints, PFS (A, C) and mPFS (B, D).The treatment effects are measured by HR for OS, and by the HR and the rRMST for the two surrogate endpoints.Each circle represents a comparison, and the surface area of the circle is proportional to the number of patients in the corresponding comparison.Straight line represents weighted regression line.Dashed lines represent 95% prediction bands based on the values predicted by the weighted regression model.The surrogate threshold effect (STE) is represented by the intersection point between the horizonal line y=1 and the upper 95% prediction band.Black diamond indicates the meta-analytic pooled estimate.The diamond's width represents the 95% CI of the surrogate pooled estimate, and height represents the 95% CI of the HR OS pooled estimate.The surrogacy equation between the log-transformed treatment effects and the ln-HR OS estimated from the weighted linear regression, the R 2 coefficient, and the pooled ratio between surrogate endpoint and HR OS were also reported with their 95% CI (displayed in square brackets).
) guidelines to perform this systematic review and metaanalysis.We searched PubMed, Embase, and Scopus for phase II or III RCTs testing ICIs, published from the inception of each database to December 31, 2021.We also reviewed abstracts and presentations from all major conference proceedings, including the American Society of Clinical Oncology and the European Society for Medical Oncology, from January 2010 to December 2021.
HR, natural logarithm of hazard ratio; ln-rRMST, natural logarithm of ratio of restricted mean survival time; mPFS, modified progression-free survival; OS, overall survival; PFS, progression-free survival; PH, proportional hazards; R 2 , coefficient of determination; RCT, randomized clinical trial; RMST, restricted mean survival time; rRMST, ratio of restricted mean survival time; STE, surrogate threshold effect; WLS, weighted linear regression; FUP, follow-up period.

TABLE 1
Main features of included trials.
OS, overall survival; PFS, progression-free survival; ORR, overall response rate.detectionbias, since only 21 out of 61 RCTs had a double blinding design.Supplementary Table

TABLE 2
Main features of included comparisons.
-value 0.014 for HR OS vs HR PFS , rRMST PFS and rRMST mPFS ).In ICI plus ICI or other treatment(s) strata, the agreement was very poor with all the surrogate measures (Cohen's Kappa coefficient ranged from -0.06 for HR OS vs HR mPFS [McNemar's test p-value = 0.257] to 0.12 for HR OS vs HR PFS and rRMST PFS [McNemar's test p-value = 0.414]).