Considering the Validity of the SAIL Trial—A Navel Gazers Guide to the SAIL Trial

This article explores the validity of the Sustained Aeration Inflation for Infant Lungs (SAIL) randomized controlled clinical trial. The SAIL trial enrolled 460 infants out of a planned 600, but the trial was stopped early for harm. We ask here, whether there were any threats to validity in the trial as conducted. We then explore what design elements of the trial could have been improved upon. Finally, we consider what the implications are for future trials in this arena. Clinical Trial Registration: www.clinicaltrials.gov, Identifier: NCT02139800


BACKGROUND
The Sustained Aeration for Infant Lungs (SAIL) trial 1 asked the PICOT question: "In preterm infants born at 23+0 to 26+6 weeks gestation, who required resuscitation, did the randomized intervention of sustained inflation (SI) followed by nCPAP, as compared to standard resuscitation (nCPAP followed if needed by intubation) reduce the primary outcome of BPD or death by 36 weeks?" (1). This was an appropriate question following extensive animal work (2,3), and some human research, suggesting a potential benefit of SI (4)(5)(6)(7)(8). Two systematic reviews of the human studies concluded a larger trial was needed (9,10). Various authoritative bodies also recommended performing larger trials before SI could be considered standard practice (11,12).
In the 18 SAIL participating sites in 9 countries, two procedures to obtain consent were used. In 16 sites, women likely to deliver within the gestational age window were approached for antenatal consent, unless delivery was imminent. With IRB approval, 4 of these sites endorsed a deferred consent process if there was insufficient time for an antenatal consent, or maternal condition made antenatal consent inappropriate; the IRBs of 2 sites approved deferred consent only (1). Details can be found in the report (1), and the study protocol (13).
As we reported (1), the trial was closed early for harm (excess early mortality) in the SI group, and very low probability of benefit with respect to the primary outcome if enrollment continued. Prior to trial start, the Data Safety Monitoring Board, trial sponsors (NICHD) and executive members of the trial, agreed on stopping rules for efficacy. In addition, an extensive early reporting system of harms was initiated. Early deaths were defined as those occurring within the first 48 h of life. Monitoring of this and other adverse outcomes was pre-planned.
The trial was stopped after the enrolment of 460 of the planned sample size of 600 infants. Our abstract reports that: "Among 460 infants randomized (mean [ 13.1%]; P = 0.29). Death at <48 h of age occurred in 16 infants (7.4%) in the sustained inflation group vs. 3 infants (1.4%) in the standard resuscitation group (aRD, 5.6% [95% CI, 2.1-9.1%]; P = 0.002). Blinded adjudication detected an imbalance of rates of early death possibly attributable to resuscitation (sustained inflation: 11/16; standard resuscitation: 1/3). Of 27 secondary efficacy outcomes assessed by 36 weeks' postmenstrual age, 26 showed no significant difference between groups" (1). Moreover, this excess early death was most prominent in the pre-specified stratum containing the smallest and most immature infants. Of 16 early deaths in the sustained inflation group, 11 were in the 23-to 24-week stratum; and of 3 deaths in the standard resuscitation group, 2 were in the 23-to 24-week stratum.
When the Sustained Aeration for Lung injury (SAIL) trial was closed in January 2018 for harm, it was followed by considerable soul searching by the investigators. Could the results have been anticipated? Should the study have been prematurely closed? How were the results to be interpreted? Following publication, these questions were quickly echoed by the neonatal community. In this paper, we address two questions: (i) How valid were the results of the SAIL trial, and thus how confident can we be about these results? (ii) Could the design of the trial have been improved? As a corollary, we also ask if there are the implications for future study design.

The Validity of the SAIL Trial
The validity, or "believability" of a trial is composed of two parts-internal and external validity (14). Internal validity asks how well a trial was conducted, in ensuring appropriate safeguards against bias, or systematic deviations from the truth? (15). In particular, were randomization and trial methods adequate to minimize threats of bias, which might render baseline risk factors between groups unequal (16). External validity addresses whether the results are generalizable to the wider population, and asks "Do I as a clinician recognize these infants, and could I carry out the maneuver of interest in these infants?" Internal validity, judged by scrutiny of demographic baseline variables, appears to carry no threats of bias. In particular, there were no imbalances of gender, birth weights, or receipt of antenatal corticosteroids (1). However, one center exclusively used a deferred consent process. Overall, 34 randomized infants from the deferred consent group, were subsequently excluded because of lack of parental consent. Overall, there was no imbalance in those withdrawing consent by group: the CONSORT diagram (1) depicts 21/114 SI group vs. 13/111 standard resuscitation group [P = 0.19 unadjusted RD 13.1% (−4.7, 30.9); by Fisher exact]. Nonetheless, early stopping did constitute a possible threat to internal validity (17). Further potential concerns regarding validity arise because we did not monitor at the bedside the adherence to the SI ranges chosen or the effectiveness with which the maneuver was performed. However, this would have been extremely expensive to do, and video recording would not have been accepted by some IRBs.
Moreover, an extensive training programme was undertaken. Finally, results analyzed by geographical regions showed no heterogeneity i.e., there was no evidence of a differential effect on primary outcome between sites where SI had been standard practice (Europe) compared to previously naïve sites (North America and Australia).
External validity appears satisfactory since the population of 23-26 week gestation infants is likely to be the same across well-resourced countries. However it is true that SAIL inclusion criteria specified: "Infants between 23 weeks 0 days' and 26 weeks 6 days' gestational age were eligible if they required positive pressure resuscitation because of inadequate respiratory effort or a heart rate <100 beats per minute (bpm)." This could be termed a "rescue" approach, and differs from other studies which enrolled all very small preterm infants, where a "prophylactic" approach was used (5).
This may explain a lack of benefit from SI in the SAIL trial and is consistent with the observation in a rabbit model that effective sustained inflation requires an open glottis (18). In addition, some human studies suggest that gain in Functional Residual Capacity is only seen when the infant is actively breathing (19,20). Nonetheless, in the SAIL CONSORT diagram, of 546 infants assessed for eligibility in SAIL, only 86 were excluded as ineligible on delivery, of whom 70 had adequate respiration. Therefore, most infants in this GA category [mean (SD) 25.3 (0.97) weeks] were not judged to have adequate respirations at birth. This contrasts with video observations where most preterm infants, showed spontaneous respirations (21). However, these infants were more mature [mean (SD) GA of 26 (2) week]. There is a higher vulnerability of the most immature infants in SAIL, such that most of the deaths are in the 23-24 week gestation. One implication of this finding is that caution should always be exercised in extrapolating findings from a more mature population to infants at the borderline of viability who may respond differently or be more vulnerable to adverse effects of an intervention.
One other feature of external validity is whether the results make clinical sense. Thus, while a difference in rates of early death was evident, the causes of this early death were adjudicated by the site investigators. There was no mandated post-mortem. No clear causal pathways mediating this excess of early death are obvious. Prior to conducting the trial, either pneumothorax or excess intraventricular hemorrhage would have been amongst the predicted causal pathways, but neither were elevated in the SI group (1). While this is intellectually unsatisfactory, it is counterweighed by the undoubted numerical excess of death in the first 48 h of life.
Having ruminated on these issues, there may be no agreement amongst observers. Perhaps all would agree that the most robust assessment of the validity of the findings of harm would be replication in a similar large trial. Or failing that, a secondary confirmation from pooled data of gestational age subgroups. Such a study has been performed, by Foglia and colleagues, and is currently under review.

Trial Design Implications
We suggest that there are at least three potentially actionable implications arising from the above considerations: i) The increasing awareness of extreme vulnerability at lower GAs would argue strongly for more attention to gestational age stratification in future studies. However, to avoid further frustrations, comparisons within strata should be adequately powered to be able to address substantive questions. ii) DSMBs should not only pre-specify a priori statistical significance levels for efficacy, but also for harm. While mortality in the first 48 h was a pre-specified safety outcome reported to the DSMB, stopping rules were not defined for all reportable safety outcomes. Most of the investigating team believe that the trial should have been stopped, because ultimately the imbalance in early mortality rates represented a true and important difference in treatment effect rather than a chance occurrence. However, pre-specifying statistical significance levels, levels to trigger early stopping for defined safety outcomes might have made the process easier. iii) Death is a common outcome in these vulnerable infants.
Perhaps inadequate attention has been paid to the causes of death. Certainly one commentator has observed this to be a relatively important failing of recent neonatal trials (22).

CONCLUSION
We are collectively still learning how best to perform trials in the delivery room. But the SAIL trial has taught us some important lessons, and we believe these should inform the next generation of newborn resuscitation trials. We suggest that as Sustained Inflation was applied in the SAIL trial, and in the dosing given in SAIL, the human experience differs from the animal data.
To this extent, the maneuver of sustained inflation is another example where human and animal results of therapy differ (23). However, this is not equivalent to rejecting the concepts underlying sustained inflation, involving inflationary opening of the lung. New trials are underway to evaluate incremental PEEP in the delivery room (24). We are not aware of any trials being undertaken where an SI is delivered via an endotracheal tube, but since this is the modality used in the animal studies, such a paradigm shift may be indicated. In any case, the principle of using large randomized trials to assess delivery room interventions, should not be discarded despite their difficulty.

DATA AVAILABILITY STATEMENT
No datasets were generated or analyzed for this paper. However the SAIL datasets will be available subject to NICHD regulations, and approval by study executives -after a followup phase is completed.