Predicting RTS,S Vaccine-Mediated Protection from Transcriptomes in a Malaria-Challenge Clinical Trial

The RTS,S candidate malaria vaccine can protect against controlled human malaria infection (CHMI), but how protection is achieved remains unclear. Here, we have analyzed longitudinal peripheral blood transcriptome and immunogenicity data from a clinical efficacy trial in which healthy adults received three RTS,S doses 4 weeks apart followed by CHMI 2 weeks later. Multiway partial least squares discriminant analysis (N-PLS-DA) of transcriptome data identified 110 genes that could be used in predictive models of protection. Among the 110 genes, 42 had known immune-related functions, including 29 that were related to the NF-κB-signaling pathway and 14 to the IFN-γ-signaling pathway. Post-dose 3 serum IFN-γ concentrations were also correlated with protection; and N-PLS-DA of IFN-γ-signaling pathway transcriptome data selected almost all (44/45) of the representative genes for predictive models of protection. Hence, the identification of the NF-κB and IFN-γ pathways provides further insight into how vaccine-mediated protection may be achieved.


N-PLS-DA METHODOLOGICAL DETAILS
The transcriptome data set was represented as a multiway data set (subjectprobe set [gene]time). In the N-PLS-DA, the data were transformed in to a series of components (similar to principal component analysis [PCA] (Jackson, 1991;Jolliffe, 2002)) where the first component encapsulates the most variation in the total data set that correlates with controlled human malaria infection (CHMI) outcome, and the subsequent components encapsulate progressively less variation. Together, these factors describe variations in the data set which were encapsulated in predictive mathematical models. Hence the kinetics of the changes induced by the vaccination were captured explicitly in each of the mathematical models.
Each mathematical model was generated through the iterative selection of probe sets and the selection of the minimal number of components required from the transformed data set to achieve optimal model performance. Model performance was evaluated by a double cross validation (DCV) approach. DCV resulted in 10 collections (ensembles) of 10 models of correlation, yielding a total of 100 individual models, with performance statistics. The difference in model performance was identified using the DQ 2 statistic. This statistic is based on a least-squares method for analyzing the difference between prediction and CHMI outcome (Westerhuis et al., 2008) and was more discriminatory than using the fraction of correctly classified outcomes. The consideration of two or three components was typically sufficient for optimal prediction performance. Each model typically consisted of data from 2-40 probe sets and optimal performance was typically observed after several rounds of probe set selection. Predictive performance was validated using label permutation. The worst, average and best model performance measures in a given ensemble of models were always higher than the most frequent performance measure generated by label

SUPPLEMENTARY TABLE 1
Genes/probe sets selected by the data-driven modeling Gene Probe set ID Frequency of use in models Cluster Probe set ID Frequency of use in models Cluster For genes with more than one probe set, the data from b the probe set that was most frequently represented in the models were considered as the representative data for that gene in the manuscript.

SUPPLEMENTARY TABLE 3
Genes/probe sets used in the IFN-driven modeling Gene Probe set ID Frequency of use in models Cluster For genes with more than one probe set, the data from the b probe set that was most frequently represented in the models were considered as the representative data for that gene in the manuscript. c Genes/probe sets that were not selected by the modeling process.

SUPPLEMENTARY FIGURE 2
The evaluation of IFN-pathway gene expression for a potential microarray-batch effect using the validation-transcriptome data set.

Gene
Unlike the principal transcriptome data set, the validation transcriptome data set was generated from a single kit of microarrays (Vahey et al., 2010). Mean RNA-expression levels relative to pre-dose 1 (prePI), at pre-dose 3 [prePIII] and 1, 3 and 14 days after dose 3 [1dPIII, 3dPIII, and 14d PIII, respectively])., with respect to protection status of subjects (protected [PR], non-protected [NP] and non-protected with delayed parasitemia [DL]) for each of the four clusters (A-D) of probe sets among the 116 probe sets (110 genes) identified by the data-driven model. The error bars indicate the standard error of the mean (SEM). Also, simulated modeling suggested that such a batch effect (which may have confounded the effect of identifying protection status at 14dPIII) would have been mitigated by the N-PLS-DA because data from several time points were included (not shown).