Can we predict discordant RECIST 1.1 evaluations in double read clinical trials?

Background In lung clinical trials with imaging, blinded independent central review with double reads is recommended to reduce evaluation bias and the Response Evaluation Criteria In Solid Tumor (RECIST) is still widely used. We retrospectively analyzed the inter-reader discrepancies rate over time, the risk factors for discrepancies related to baseline evaluations, and the potential of machine learning to predict inter-reader discrepancies. Materials and methods We retrospectively analyzed five BICR clinical trials for patients on immunotherapy or targeted therapy for lung cancer. Double reads of 1724 patients involving 17 radiologists were performed using RECIST 1.1. We evaluated the rate of discrepancies over time according to four endpoints: progressive disease declared (PDD), date of progressive disease (DOPD), best overall response (BOR), and date of the first response (DOFR). Risk factors associated with discrepancies were analyzed, two predictive models were evaluated. Results At the end of trials, the discrepancy rates between trials were not different. On average, the discrepancy rates were 21.0%, 41.0%, 28.8%, and 48.8% for PDD, DOPD, BOR, and DOFR, respectively. Over time, the discrepancy rate was higher for DOFR than DOPD, and the rates increased as the trial progressed, even after accrual was completed. It was rare for readers to not find any disease, for less than 7% of patients, at least one reader selected non-measurable disease only (NTL). Often the readers selected some of their target lesions (TLs) and NTLs in different organs, with ranges of 36.0-57.9% and 60.5-73.5% of patients, respectively. Rarely (4-8.1%) two readers selected all their TLs in different locations. Significant risk factors were different depending on the endpoint and the trial being considered. Prediction had a poor performance but the positive predictive value was higher than 80%. The best classification was obtained with BOR. Conclusion Predicting discordance rates necessitates having knowledge of patient accrual, patient survival, and the probability of discordances over time. In lung cancer trials, although risk factors for inter-reader discrepancies are known, they are weakly significant, the ability to predict discrepancies from baseline data is limited. To boost prediction accuracy, it would be necessary to enhance baseline-derived features or create new ones, considering other risk factors and looking into optimal reader associations.


Introduction
In 2004, the Food and Drug Administration recommended double radiology reads for clinical trials with blinded independent central review (BICR) to minimize evaluation bias (1).Due to the variabilities in observers' evaluations, diagnostic results can be discordant (2).In such situations, a third radiologist, the "adjudicator", is required so that a final decision can be made (3).The rate of discordance (a.k.a. the adjudication rate), which is the number of discordant evaluations out of the total number of patients in the trial, is the preferred high-level indicator that summarizes the overall reliability of assessments of trials (4).The monitoring of observers' variability through the adjudication rate requires a burdensome process, which all stakeholders aim to make cost-effective (5,6), with the goal of minimizing interreader discordances.
The Response Evaluation Criteria In Solid Tumor (RECIST) ( 7) is widely used and accepted by regulatory authorities for evaluating the efficacy of oncology therapies in clinical trials with imaging.The very purpose of RECIST is to assign each patient to one of the classes of response to therapy: progressive disease (PD), stable disease (SD) or responders (partial or complete response [PR, CR]) (7).When categorized as PD, patients must be withdrawn from the study and their treatment stopped.Depending on the development phase of the drug (8), different trial endpoints can be derived from the RECIST assessments.Indeed, in phase 2, the study endpoint generally relates to response (responder vs nonresponder) while in phase 3, it relates to progression.Each of these trial endpoints have their own statistical features linked to their respective kind of inter-reader discrepancy (KoD).Therefore, for a given trial, the most relevant KoD to monitor can differ from another trial.
During trials, patients undergo a sequential radiological RECIST 1.1 evaluation with a probability of discrepancy occurring at each radiological timepoint response (RTPR).We can hypothesize that each of the KoDs has a different likelihood of occurrence over time.We can also assume that "at-risk-periods" and "at-risk-factors" exist for discrepancies occurring during patient follow-up.From an operational standpoint, confirming these assumptions would be particularly relevant for BICR trials with double reads (3,9).
Clinical trials can often take a long time to complete, during which changes may arise from various sources: readers may become more experienced, tumor shapes may complexify, or operational parameters may have unintended impacts.Thus, it is essential to assess the broad trends while the trial is in progress to gain insight into any potential changes that may have occurred and their effects on the trial's ultimate results.
The issue around RECIST subjectivity has been widely discussed (10,11) and there is consensus on risk factors related to disease evaluation at baseline.These risk factors can be grouped into tumor selection (12) and quantification (13).However, it is still unclear which of these risk factors most impact the response and how they interact with each other to allow prediction at baseline as to which reads are more likely to become discrepant at follow-up.In addition, a data-driven approach using machine learning (ML) could be an opportunity to test whether baseline-derived features are predictors of inter-reader discordances.
We conducted a retrospective analysis of inter-reader discrepancies in five BICR RECIST lung trials with double reads with three primary objectives: 1) to investigate the discrepancy rate over time, aiming to identify influential high-level factors, 2) to identify risk factors for discrepancies related to RECIST-derived features at baseline, and 3) to use confirmed risk factors for discrepancies to evaluate predictive models.

Study data inclusion criteria
Our retrospective analysis included results from five BICR clinical trials (Trials 1-5) that evaluated immunotherapy or targeted therapy for lung cancer (Table 1).The selected BICR trials were conducted between 2017 and 2021 with double reads with adjudication based on RECIST 1.1 guidelines.All data were fully blinded regarding study sponsor, study protocol number, therapeutic agent, subject demographics, and randomization.For these five trials, a total of 1724 patients were expected, involving 17 radiologists.The central reads were all performed using the same radiological reading platform (iSee; Median Technologies, France).

Read paradigm
Two independent radiologists performed the review of each image and determined the RTPR in accordance with RECIST 1.1.According to the trial's endpoints (response or progression), specific KoDs triggered adjudications that were pre-defined in an imaging review charter.The adjudicator reviewed the response assessments from the two primary readers and endorsed the

Analysis plan
We considered four KoDs related to the standard endpoints used on trials: A. Two related to event or time-to-event of disease progression: 1. Progressive disease declared (PDD): Discrepant PD detection when only one of the readers declared PD during the follow-up.One example of patient follow-up with corresponding KoDs is provided in Table 2.
For each trial and KoD, our study addressed the distribution of the discrepancies, the risk factors for discrepancies, and their predictions:

Distribution of discrepancies
a) The rate of discrepant patients At trial completion (or near completion for Trial 1), we measured the ratio of the number of patients for whom a KoD was detected during their follow-up to the number of patients.
b) The rate of discrepant patients over time The temporal discrepancy rate can be written as the cumulative function of the distribution of probability for discrepancy at each  For each KoD and for each trial, we analyzed the rate of discrepancies over time.This rate can be simply calculated as the ratio of the number of discrepant patients included from the beginning of the study to the number of patients included from the start of the study during the same period (Equation 2).
Where t is the time, t=0 at trial onset (first patient in).c) The proportion of discrepancies occurring at each followup time For each KoD and each trial, we computed the average proportion of discrepant patients occurring between two time points (Equation 3).These proportions, which are functions of the progression free survival (PFS) curve (therefore of the survival curve) and the probability of KoD occurrence during an interval of time, are displayed along with the proportions of patients still evaluated until this time point.Where FU is an interval in patient follow-up time and t ranging [Baseline; FUmax] is the maximum follow-up duration measured in the study.
d) The probability of discrepancy during follow-up For each KoD and each trial, as presented earlier in Equation 1, we provided the probability of a patient having a discrepant diagnosis during a given follow-up interval.We computed this probability as the ratio of the number of discrepant diagnoses to the number of patients that were evaluated during this follow-up interval of time (Equation 4).
Where FU is a given follow-up time point.Our discrepancy analysis considered only patients who underwent at least one follow-up visit after baseline, for a clearer display, we resampled curves in a standardized time-frame of one month.

Baseline-derived risk factors for discrepancy
We wanted to identify the variabilities in the RECIST process of selection and/or measurement performed at baseline (Figure 1) which were likely to entail discrepancy in responses (Figure 2).For this aim, we arbitrarily considered risk factors likely 1) to quantify measurement variability at baseline in computing the Delta Burden between the two readers as the relative difference of their SOD (Abs (SOD1-SOD2)/(SOD1+SOD2)) and in measuring SPropSOD (Annex C); 2) to quantify the variability in TL selection at baseline in reporting when the two readers did not select their TLs in exactly the same organs, when they selected TLs in totally different organs, or when one of the reader selected a TL in a Inter-observer discrepancies at baseline.For a same lesion, two radiologist might consider non measureable a lesion due to its size not meeting the RECIST 1.1 measurability criteria of 1cm long axis for non lymph-node lesions (1a,1b,2a,2b) and 1.5cm short axis for lymph node (3a,3b).The measureability criteria can also be challenged for large lesions with ill defined margins and non robust measurement (4a,4b).Two radiologist might consider one same disease lesion belonging to lymph-node or not lymph-node organ resulting in a large measurement discrepancy due to the different method of measurement for the two type of organs (5a,5b).Measurement discrepancy can also be linked to the selected series for measurement such as arterial versus portal phase (6a,6b) or to the type of lesions such as cavitary lesions (7a,7b).
particularly infrequent location.We also reported when, at least, one of the readers did not select any TL at baseline.At least, we considered a risk factor when the two readers did not select all their NTL in the same organs.Comprehensive description of all the risk factors are provided in Annex A. By means of odds ratios (ODDs), we performed a univariate analysis testing associations between KoDs and a set of predefined features (14) (Annex A) derived from risk factors.These features applied to target lesions (TLs) and nontarget lesions (NTLs) and were stratified according to the different diseased organs (See Annex B).

Predictions of discrepancy derived from baseline evaluations
Since the association of a risk factor with a given outcome must be strong (ODD>10) to make classification effective (15), we used previously identified risk factors to train a ML model.After features reduction and classification, we documented the performances of two classification systems.

Statistics
All statistics were performed using base version and packages from R CRAN freeware.
Confidence Intervals (CIs) of discrepancies rates were computed by Clopper-Pearson exact CI method (16).We used "PropCIs" package.Multiple comparisons of continuous variables were performed using the Dunnett-Tukey-Kramer method for unequal sample size (17) with "DTK" package.Multiple comparisons of proportions were performed using Marascuilo test (18).
We derived the proportions of detected KoDs from the 95 th percentile of patient follow-up duration in trials and the 95 th percentile of follow-up duration until the first occurrence of KoDs.ODDs were computed using "fmsb" package (19), with associated p-values for significant associations.
Continuous variables were analyzed using two samples nonparametric Wilcoxon test (discrepant versus non-discrepant patient groups).
A predictive model of discrepant patient evaluation was trained and tested in a cross-validation with 80:20 split setting.We reported classification accuracy when McNemar's test indicated no significant bias in assessments due to imbalance in the data.We also reported the Area Under the Curve (AUC).
We evaluated the classification performances using two different algorithms: 1) a random forest (RF) algorithm (20) from the "caret" package after recursive feature elimination (21) and 2) a deep learning (DL) algorithm from the "h2o" package (22) after grid search.CIs were computed for AUC, accuracy (Acc), sensitivity (Se), specificity (Sp), positive predictive value (PPV), and negative predictive value (NPV) using the bootstrap method from the "DescTools" package.RECIST 1.1 inter-reader variability at baseline leading to endpoint discrepancy.The patient is presented at baseline with involvement of several mediastinal and hilar lymph nodes.Both readers followed RECIST 1.1 guidelines and accurately selected 2 large lymph nodes for measurement without any errors.During the follow-up period, both readers observed a partial response, with the first response documented in week 9.However, at week 18, the readers disagreed on the assessment of disease progression.for PDD, DOPD, BOR, DOFR, respectively.When combining the data from all five trials, a multiple comparison test showed that there were significant differences between the two KoDs related to time (DOPD and DOFR) and the two KoDs related to the event (PDD and BOR).The discrepancy rate for PDD was lower than for the other KoDs.

Discrepancy rates
Figure 3 displays the discrepancy rates over time for the five trials and the different KoDs, showing that, most of the time, DOFR was higher than DOPD.We observed that the rates of all KoDs increased as the trial progressed, even long after the completion of patient accrual.The KoD curves did not always feature smooth variations, and the curves of accrual displayed different shapes.It seems that a significant patient recruitment immediately from the start of the study will guarantee early meaningful KoD curves.
As depicted in Figure 4, the ratio of KoD occurrence and the proportion of patients remaining at this time point followed a steady downward trend over time.At the outset of follow-up, proportionally more DOFR and BOR occurred than DOPD and PDD.Additionally, Figure 4 demonstrates that the decrease in the proportion of KoD occurrence had distinct patterns.For some trials (Trials 1, 2, and 5), the decrease had a tendency similar to the proportion of patients still present at this time, while for others, this was not the case (Trials 3 and 4).
In Figure 5, we present the probability of KoDs occurring in relation to consecutive time points.In Figure 5, we can see that the probability of discrepancies occurring with DOFR was significantly higher at the beginning of follow-up, but became closer to the other KoDs at the following time point.
After 6 months, the probabilities of Trial 3 and 4 KoDs were generally less than 10%.They were higher for other trials.DOFR was the most likely KoD in the initial part of the patient evaluation, and the ordering of KoDs was stable when considering phase 3 studies i.e., DOFR > DOPD > BOR > PDD.
Regarding progression-related KoDs, the probability of PDD at the patient level was quite stable over cycles while the probability of DOPD tended to decrease over cycles.PDD was also globally lower than the KoDs of response at the beginning of evaluations.Proportion of discrepancies as distributed during follow-ups.For each trial and KoD, we computed the proportion of discrepancies (Equation 3) occurring at each follow-up.The DOPD, PDD, DOFR, and BOR KoDs are displayed in red, orange, blue, and green, respectively.The black curve displays the proportion of patients evaluated at this time point.BOR, best overall response; DOFR, date of first response; DOPD, declaration of progressive disease; KoD, kind of inter-reader discrepancy; PDD, progressive disease declared.

Evaluation of risks factors derived from baseline discrepancies
We found that for a single trial (Trial 3), one reader did not identify any disease in two patients out of 243 (0.8%), without any evidence that there was a significant risk factor of discrepancy, particularly for DOPD (p=0.68) or DOFR (p=0.1).
Except for Trial 3, classification of disease as non-measurable by at least one reader occurred in less than 7% of patients.It was very common for readers to select some of their TLs and NTLs in different organs, ranging from 36.0-57.9% and 60.5-73.5% of patients, respectively, but rarely (4.0-8.1%) did the two readers select all of their TLs in different locations.
We did not identify any set of risk factors which was relevant to all trials for a given KoD.
As summarized in Table 4, significant risk factors varied depending on the KoD and the trial examined.Analysis of pooled data revealed significant risk factors associated with BOR, whereas DOPD had a single risk factor (non-measurable disease reported by one reader) that could be identified at baseline.PDD had none.One or more readers choosing an infrequent disease location was not a risk factor for discrepancy.

Prediction of discrepancy derived from baseline evaluations
Our risk factors analysis showed that progression-related KoDs were marginally impacted by baseline evaluation.Therefore, our evaluation of predictive models focused mainly on the responserelated KoDs of DOFR and BOR.
Based on our features set ( 14), and a pre-processing of feature selection for RF algorithm, classification performances for response-related KoDs are summarized in Table 5.For each independent clinical trial or in pooled data, feature selection did not improve classification performances.Using the validation dataset of pooled data, DL outperformed the RF algorithm.Performances with DL were poor but with a PPV higher than 80% for all the KoDs.The best classification performances were obtained with BOR based on AUC.For DOPD, AUCs was 57.3 [57.1; 57.5].

Discussion of our results
At completion of trials, discrepancy rates based on DOPD or DOFR were comparable across trials with respective average values of 41.0% and 48.8%.Over time, the rates of KoDs steadily increased as the trials progressed, even after the end of patient accrual.The discrepancy rates for the time-related endpoints, DOPD and DOFR, were always higher than their event-related PDD and BOR counterparts.A higher proportion for DOPD was expected as the counting of DOPD occurrences encompassed those of PDD.Translated into clinical study endpoints, these observations mean that discrepancy rates for overall response rate are generally higher than for PFS.
Assuming part of the discrepancies was attributable to a delayed event detection by one of the readers, it could be expected that the proportion of DOFR would be higher than DOPD at an earlier stage, as progressive patients were withdrawn from the trial; the only room for patient response was then before progression.
As Figures 3-5 show, the number of recruited patients, PatIncl (t), at each time point is an operational-related function that can significantly vary between trials and is difficult to predict.The survival curve, Surv(t) (and the PFS curve), however, is dependent on the drug and/or disease, and can be predicted to some degree.The likelihood of a discrepancy occurring at each time point is more of a measure of the reading process, which is partially dependent on the readers' abilities and the complexity of the observed disease.To predict the rate of discrepancies over time, one must be able to completely regulate the trials, the drugs, and the readers' performances.The mathematical formulation of the temporal discrepancy rate (Equation 1) helps us understand the intricacies of trials by breaking down the components that interact with one another.When the rate of discrepancies increases, it can be difficult to determine if the cause is a pattern of patient recruitment or if readers are simply more prone to making mistakes.
Regarding our risk factor analysis, we found that at least one reader not detecting disease when it was likely present was very rare and not a major concern.Selecting non-measurable disease (i.e., NTLs only) was more prevalent and was considered a significant risk when pooling the five trials.
The analysis of discrepancies between DOPD and DOFR showed that readers often selected TLs (43.5% of patients) or NTLs (65.3%) in different organs without being critical of discrepancies.When all the data were pooled, most of the risk factors were significant for BOR.Selecting non-measurable disease was the only risk factor for DOPD discrepancies.The selection of infrequent disease by any reader was not a risk factor regardless of the KoD.
Feature selection did not improve RF classification performance.DL slightly outperformed the RF algorithm but with globally poor classification performance.Best performances (AUC based) were reached in detecting BOR (61.9).

Discussion around the literature 4.2.1 Discrepancy rate
The discrepancy rates we found at end of trial agree with the literature (8,23).Considering the naïve assumption of equiprobability between the four RECIST classes of response, the ranking of our measured discrepancies rates DR(DOFR) > DR ( D O P D ) a r e c o n s i s t e n t w i t h t h e b a s i c r u l e s o f combinatory probabilities.

Discrepancy timing
Most discrepancies occurring earlier for DOFR than for DOPD can be explained by the fact that a genuine response can only take place before a progression (except for pseudo progressions), because patients are withdrawn from trials after a PD.
As we observed in Figure 2, the ranking of KoD rates could change over time (crossing curves).This indicates that the probability of inter-reader discrepancy is not a stationary process in time (24).This factor should be considered when it comes to improving the modeling of trial monitoring, as illustrated by Equation 1, and when designing new metrics.The probability of PDD at each cycle seemed the most stable and lower than other KoDs.The NTL category only determines three events: CR, PD, and stability (Not CR/Not PD).The PR event is only determined by measurable disease.In support of this finding, Raskin et al. (25) showed that NTLs were an important factor for detecting PD, while Park et al. (26) revealed that selecting metastasis-only lesions as TLs may be more effective for determining response in kidney disease.
In our observations, for more than 5% of patients, at least one reader did not detect any measurable disease, so only NTLs were selected.Under these conditions, it is not surprising to observe a significant correlation with the occurrence of a response-related KoD (Table 4).
Moreover, the NTL category theoretically includes less defined and smaller lesions, making them more equivocal during the first evaluation of the disease.The high prevalence of selected NTLs in different organs (65% of patients) reflects this uncertainty when capturing the disease at baseline.Lheureux et al. ( 27) developed a comprehensive discussion about the equivocality associated with RECIST, which is responsible for concerns related to its reliability.Moreover, during follow-up, due to the "under-representation" of this category and the qualitative appreciation of non-measurable disease, RECIST recommends interpreting progression of NTLs by considering the entire disease.Indeed, it is rare to observe a PD event triggered solely by the NTL category.This is reported during paradoxical progression in approximately 10% of progression cases (28).Finally, since the CR event is quite rare in our patients in advanced clinical stage, the influence of a difference in the appreciation of the non-measurable disease ultimately presents little risk in terms of variability on the study's endpoints.Cases that are ultimately more at risk relate to patients with a low measurable tumor mass compared to the non-measurable disease.However, the detection of these cases is very difficult in the absence of quantification of NTLs [see scenario F in the supplementary appendix of Seymour et al. (29)].

TL selection
We showed that readers selected TLs in different organs in 36.0% to 57.9% of patients, with no association with DOPD or DOFR discrepancies and poor association with BOR.
In the study by Keil et al. (12), for 39% of patients, readers had chosen different TLs, demonstrating a strong association with DOPD.Keil

Sum of diameters value
Our study confirmed findings by Sharma et al. (31) who concluded that there was an association of the variability of SOD at baseline with the variability of the study endpoint.However, our percentage of specific sum of diameters (SPropSOD) analysis did not confirm or contradict other works about dissociated responses (28,32), probably because this phenomenon is reportedly observed in only around 10% of cases (28).

Location (TL lung & infrequent)
Regarding discrepancies in targeting the most frequent location, this was poorly associated with variability of responses.We considered five primary lung cancer trials, but for 17% of the patients, at least one reader did not select any lung TL.
Regarding discrepancies in identifying disease (TL or NTL) in infrequent locations, surprisingly we found no risk factors associated, although some authors discuss the controversial use of RECIST outside the most targeted disease locations (33).

Progression-related KoDs
Several studies (34, 35) have documented that more than half of discrepancies in reporting progressing disease are triggered by debatable detection of new lesions.Thus, they are not concerned with baseline evaluations, preventing prediction from baseline.Unlike Keil et al. (12), we did not measure a systematic risk factor associated with TL selection.The only exception was in a specific trial (Trial 4), when readers selected all of their TLs from completely different organs.

Classifications
According to Corso et al. (20), feature selection does not improve classification performances.For all the KoDs, DL outperformed RF.Overall, classification performance was low.Specifically regarding detection of DOPD, our findings were consistent with studies that reported that the majority of DOPD discrepancies were due to the misdetection of new lesions (34), which is not linked to baseline.The best classification performances were obtained for BOR, albeit with poor performances.
Even though certain studies suggest that baseline selection and measurement have considerable influence on the accuracy of response assessment (12,13), we found that, while they existed, these correlations were weak and heterogeneous.Therefore, additional root causes of variability could be studied, such as follow-up management (e.g., measurement variability of tumor burden or perception of NTL change), readers' associations, or fluctuating readers' perception (35).We also need to investigate the gap between selecting "exactly the same TLs" and "TLs in same organs", as the first definition is reported to have a strong association with variability in literature (36), while we found the second has a weak association.If strong heterogeneity within the same organ is confirmed, the issue of RECIST is no longer its subjectivity but its intrinsic inappropriateness in assessing patient follow-up.

Study limitations
First, we did not document some well-known risk factors for variability linked to image quality, such as variability due to reconstruction parameters (different image selection) or the timing in IV contrast injection.Indeed, we assumed that the imaging charter of the included trial adequately standardized the images so that these risks associated with these factors would be negligible.
Second, we did not analyze the impact of the selection of the same TLs by readers.When designing our study, we did not expect that collecting tumor coordinates would be of interest but assumed that checking the association with TLs in the same organ would be adequate.
Third, we did not investigate the impact of the inter-reader variability in assessing the "measurability" of tumor.Indeed, when a first reader considers a tumor as measurable and candidate to be TL included in his tumor burden, while the second reader considers the same tumor as NTL, the two readers would have a different in sensitivity at detecting responses.
Fourth, our study focused on lung data, therefore cannot be generalized to other diseases.For some other types of cancers CT is not the preferred modality, the disease spread differently, and the tumors feature different phenotypes, thus risk factors and training performances would be different.
Lastly, as we were blinded as to randomization, we were not able to refine our analysis by treatment/control.However, we can assume that KoD statistics and association with variability is different for treatment and control as those statistics are directly linked to the occurrence of the events of response or progression, supposedly different for the two arms.

Perspective on operations
Throughout the initial cycles of BICR with double reads, the rate of discrepancies is hard to analyze without a benchmark to refer to.This leads researchers to search for other key performance indicators to assess trial reliability and, eventually, take corrective measures as expeditiously as possible.
Even though we were not able to make powerful predictions, our analysis of the baseline data revealed that some inter-reader differences can affect the reliability of trial results.A tracking of baseline assessment during BICR could be a beneficial addition to trial quality control.
The poor predictive performances were probably, in part, obtained because a predictive model cannot avoid including follow up data.Another hypothesis could be that, based only on baseline data, the risk factors, and the features we considered were unable to fully capture the complexity of disagreements.In the future, we can imagine improving the features derived from tumor selection or even creating new ones; considering other risk factors in the classification, such as variability in scan selection or variabilities in involving the first follow-up and optimizing readers' associations.
However, an open question will remain: "How can we manage a baseline assessment with a high probability of becoming discrepant?".If future technologies can predict discrepancies, it is likely to be a prediction with limited explicability.Moreover, we can question whether the 2 + 1 adjudication paradigm is obsolete, as choosing between two medically justifiable differences (2) does not make sense.

Conclusion
For the discordance rate to be predictable over time, at each time point, we need to know patient accrual, patient survival, and probability of discrepancy.Discrepancies in date of responses occurred more often than those related to progressions.Careful thought should be given to corrective actions based on the analysis of KoD rates if less than 50% of patients have been enrolled.
Several risk factors for inter-reader discrepancies have been confirmed, albeit with relatively weak implications.We found that for around 50% of patients, readers chose tumors in different organs without impacting the variability of responses.The prediction performances of inter-reader discrepancies based on the baseline selection were poor.Baseline-derived features should be improved or new ones designed, other risks factors must be considered for predicting discordances, and optimal reader association must be investigated.

Take home messages
1) The discrepancy rate over time depends on patient accrual, PFS, and the probability of discrepancy during follow-up.
2) At the outset of follow-up, proportionally more DOFR and BOR occurs than DOPD and PDD.
3) The KoD rates are not stabilized in the first half period of total patient inclusion and should be interpreted carefully.
Baseline variability evaluation can help to determine risk of study endpoint variability.
4) The inter-reader variability in disease selection at baseline is frequent (50%).The general impact on the variability is more significant for response endpoints.
administration, original draft, review, and editing.Both authors contributed to the article and approved the submitted version.

FIGURE 3
FIGURE 3 The discrepancy rate over time.The discrepancy rates for the four KoDs are displayed along with the proportion of accrued patients.Curves for DOPD, PDD, DOFR, and BOR KoDs are displayed in red, orange, blue, and green, respectively, with corresponding 95% CIs.The black curve is the cumulative proportion of accrued patients.The five trials are represented from top left to bottom right: a) Trial 1, b) Trial 2, c) Trial 3, d) Trial 4, and e) Trial 5 Time scale is one month.

FIGURE 5
FIGURE 5 Probability of discrepancy over patients follow-up.The probability of discrepancy for the four KoDs is displayed.Probabilities of DOPD, PDD, DOFR, and BOR KoD occurrence are displayed in red, orange, blue, and green, respectively, with corresponding 95% CIs.The five trials are represented from top left to bottom right: a) Trial 1, b) Trial 2, c) Trial 3, d) Trial 4, and e) Trial 5. BOR, best overall response; CI, confidence interval; DOFR, date of first response; DOPD, declaration of progressive disease; KoD, kind of inter-reader discrepancy; PDD, progressive disease declared.

TABLE 2
Example of KoDs.

TABLE 1
Description of included trials.

TABLE 3
Discrepancy rates at end of trial.For the five clinical trials (rows), we reported, per patient, the double read discrepancy rates as percentages (raw number are in parenthesis).These were computed for each KoD (column): 1) discrepant PDD; 2) discrepant DOPD; 3) discrepant BOR; and 4) discrepant DOFR.BOR, best overall response; DOFR, date of first response; DOPD, declaration of progressive disease; PDD, progressive disease declared.

TABLE 4
Risks factors and occurrence of discrepancy.
For the five clinical trials (rows) and the different risk factors (columns), we reported the KoDs representing a significant risk: DOPD (red), PDD (Orange), DOFR (blue), BOR (green).In parentheses of corresponding colors are the values of the ODDs.In black parentheses are the percentages of patients with the potential risk factors.ODDs derived from Delta Burden and SPropSOD (Annex C) used optimized thresholds, so represented best cases.BOR, best overall response; DOFR, date of first response; DOPD, declaration of progressive disease; NTL, non-target lesion; PDD, progressive disease declared; SPropSOD, percentage of specific sum of diameters; TL, target lesion.

TABLE 5
Prediction of response KoDs derived from baseline features.
et al. had different study inclusion criteria, considering breast cancer, a single follow-up, no new target and no NTLs.Their mean number of TLs was 1.8 (2.3 in our study).Keil et al. adopted a strict definition of "same TLs" as those with the same coordinates, whereas ours was for those chosen in the same organ.Kuhl et al. (30) reported higher discrepancy rates than us (27% for DOPD).Readers selected different sets of TLs in 60% of patients with even a stronger association with readers' disagreement than Keil et al.Kuhl et al. adopted an even stricter definition of concordant selection than Keil et al.Kuhl et al. also included a broad spectrum of primary cancers.