Multi-Omic Data Integration Allows Baseline Immune Signatures to Predict Hepatitis B Vaccine Response in a Small Cohort

Background Vaccination remains one of the most effective means of reducing the burden of infectious diseases globally. Improving our understanding of the molecular basis for effective vaccine response is of paramount importance if we are to ensure the success of future vaccine development efforts. Methods We applied cutting edge multi-omics approaches to extensively characterize temporal molecular responses following vaccination with hepatitis B virus (HBV) vaccine. Data were integrated across cellular, epigenomic, transcriptomic, proteomic, and fecal microbiome profiles, and correlated to final HBV antibody titres. Results Using both an unsupervised molecular-interaction network integration method (NetworkAnalyst) and a data-driven integration approach (DIABLO), we uncovered baseline molecular patterns and pathways associated with more effective vaccine responses to HBV. Biological associations were unravelled, with signalling pathways such as JAK-STAT and interleukin signalling, Toll-like receptor cascades, interferon signalling, and Th17 cell differentiation emerging as important pre-vaccination modulators of response. Conclusion This study provides further evidence that baseline cellular and molecular characteristics of an individual’s immune system influence vaccine responses, and highlights the utility of integrating information across many parallel molecular datasets.


Table_S2_CpG_day_analysis.xlsx
Differentially methylated sites identified in epigenetic analysis, comparing each post-vaccination time point to its pre-vaccine counterpart. Results were considered significant with p-value ≤ 0.005 and absolute effect size change of ≥ 5%.

Table_S3_mRNA_day14_vs_day0_DE_genes.xlsx
Differentially expressed genes found when comparing Day 14 (post-vaccination) to Day 0 (prevaccination) samples in a paired analysis. Filtered for adjusted p-value < 0.05 and absolute fold change > 1.5.

Table_S5_proteins_day180_responder_vs_non_DA_proteins.xlsx
Differentially abundant proteins identified from comparing responder and non-responder participants (defined using Day 180 titre measures) using pre-vaccine profiles (samples). Filtered for adjusted p-value < 0.05.

Table_S6_proteins_day180_responder_vs_non_pathways.xlsx
Enriched pathways identified by Sigora when tested on DA proteins for responders vs. nonresponders (Day 180 titre measures). Filtered for significance using Bonferroni-corrected p-value ≤ 0.001.

Table_S7_CpG_day180_titre_DM_sites.xlsx
Differentially methylated sites identified in epigenetic analysis, using Day 180 titres as a continuous variable. Results were considered significant with p-value ≤ 0.005 and absolute effect size change of ≥ 5%.

Table_S8_CpG_day180_titre_pathways.xlsx
Enriched pathways found using resources from Reactome.org using the differentially methylated sites correlated with Day 180 titre measures. Results were considered significant with p-value ≤ 0.005.

Table_S9_networkanalyst_CpG_mRNA_proteins_day180_response_pathways.xl sx
Enriched Reactome pathways identified by Sigora using nodes from integrated network of responders vs. non-responders (defined with Day 180 titre values), from epigenetics, transcriptomics, and proteomics results. Filtered for significance using Bonferroni-corrected pvalue < 0.001.

hways.xlsx
Enriched KEGG pathways identified by Sigora using nodes from integrated network of responders vs. non-responders (defined with Day 180 titre values), from epigenetics, transcriptomics, and proteomics results. Filtered for significance using Bonferroni-corrected pvalue < 0.001.

Table_S11_networkanalyst_mRNA_proteins_day180_response_pathways.xlsx
Enriched Reactome pathways identified by Sigora using nodes from integrated network of responders vs. non-responders (defined with Day 180 titre values), from transcriptomics and proteomics results. Filtered for significance using Bonferroni-corrected p-value < 0.001.

Table_S12_networkanalyst_mRNA_proteins_day180_response_KEGG_pathway s.xlsx
Enriched KEGG pathways identified by Sigora using nodes from integrated network of responders vs. non-responders (defined with Day 180 titre values), from transcriptomics and proteomics results. Filtered for significance using Bonferroni-corrected p-value < 0.001.

Table_S13_diablo_features_network_pathways.xlsx
Reactome pathways found to be enriched by Sigora for node table from NetworkAnalyst combining three feature sets from the output of DIABLO (epigenetics, proteomics, and transcriptomics). Filtered for significance using Bonferroni-corrected p-value < 0.001.

Table_S14_diablo_features_network_KEGG_pathways.xlsx
KEGG pathways found to be enriched by Sigora for node table from NetworkAnalyst combining three feature sets from the output of DIABLO (epigenetics, proteomics, and transcriptomics).

Immune cell phenotyping
Cell fixation was done for two main reasons. First, given that we have longitudinal samples, we wanted the longitudinal samples from the patient as well as the entire cohort to be analyzed together to minimize batch effects. Second, fixing allows us to keep the cells at a particular static state and allows us to make comparison between samples from participants across all cohorts.
After plasma collection, the remaining blood fraction was diluted twice in RPMI and kept on ice, retaining a total of 900uL of the mixture for downstream flow cytometry analysis. Smart tube lysing and fixing solutions were used while follow the manufacturer's recommendations, and as described in Lee et al, 2019 [3]. Briefly, heparinized blood (225uL) was combined with 20mM EDTA (22uL) in cryovials before mixing via gentle pipetting and the addition of 2uL of fixable viability dye (eBioscience catalog # 65-0865-14). This mixture was then incubated in the dark at 4°C, followed by red blood cell lysis via the addition of 350uL of Smart lysis buffer (SmartTube Stable-Lyse V2) and a 15-minute incubation period at room temperature. The stained cells were then fixed with 1mL Smart store buffer (SmartTube Stable-Store V2) for 15 minutes before being transferred to -80°C for storage.
Complete blood counts were undertaken by BC Children's Hospital clinical laboratory to enumerate total white blood cell (WBC) count, plus a differential count including neutrophils, lymphocytes and monocytes using a Sysmex XN-3000 Hematology Analyzer.
For flow cytometry performed on whole blood, fresh samples were stained with fixable viability dye and stored in Smart tube solutions (Smart tube, CA, USA) in -80C. The day of analysis, fixed samples were thawed and cells with a panel of antibodies recognizing anchor markers to identify neutrophils, dendritic cells (DCs), NK cells, monocytes, T cell subsets, basophils and B Supplementary Table S1. Flow cytometry data were acquired on a LSRII flow cytometer (BD Biosciences) and analyzed using Flowjo software (version 9.9). Automated gating was undertaken on the same data where a panel of algorithms were used to clean and gate on the cell populations of interest using a combination of markers selected and a set gating strategy. First, Flowcut (https://github.com/jmeskas/flowCut) was used to detect anomalous events that happened during data acquisition. Events were removed if they had the minimum or maximum value in any of the scatter channels, with singlets selected using FSC-A and -W channels. Next, for each sample, flowDensity (a supervised gating algorithm) [4] was used to determine the thresholds for each marker in each of the biaxial plots in a data-driven manner based on the density distribution of the cells (Supplementary Figure 3 for gating strategy). Some challenging populations (such as NK subtypes, pDCs and gamma delta T cells) required a second gating step using flowPeaks, an unsupervised clustering algorithm [5]. Finally, cell counts for each cell population identified were obtained and normalized using the bead count, yielding 24 immune cell populations. For additional information, refer to the companion methods paper: Systems biology methods applied to blood and tissue for a comprehensive analysis of immune response to hepatitis B vaccine in adults.

Whole blood RNA sequencing
To complete bulk RNA-Seq, RNA was extracted using the PAXGene RNA purification kit (QIAGEN) from whole blood collected in PAXGene tubes. Assessment of RNA quality and quantity was performed using the Agilent 2100 Bioanalyzer (Santa Clara, CA, USA). Polyadenylated RNA was isolated using the NEBNext Poly(A) mRNA Magnetic Isolation Module (NEB, Ipswich, MA, USA), and cDNA libraries were created using the KAPA Stranded RNA-Seq library preparation kit (Roche, Basel, Switzerland). Samples were sequenced in two batches on the HiSeq2500 with single-end reads of length 100bp. Sequence quality was assessed using FastQC v0.11.7 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and MultiQC v1.0 [6]. Alignment to the reference Ensembl genome GRCh38 v91 [7] was done using STAR v2.5.4b [8] to generate position-sorted BAM files. Counts were generated using the htseq-count function from HTSeq v0.10.0 [9]. Analysis of RNA-Seq data was conducted using R v3.6.3 (https://www.R-project.org/). Count files were filtered to remove globin genes, and any genes with fewer than 10 counts across 15 samples (the number of biological replicates). After filtering and globin removal, samples had a median library size of 4,842,497, with a minimum of 2,217,471 and maximum of 14,755,180 reads. DESeq2 v1.26.0 [10] was used to identify differentially expressed (DE) genes. For comparison of responders and non-responders (defined in the Methods section), sequencing batch was included in the design formula to control any batch effects. DE genes were identified using the Wald statistics test, followed by filtering for significance using a combined threshold of adjusted p-value ≤0.05 and absolute fold-changes ≥1.5. In the case of comparing responders and non-responders (defined using Day 180 titre as previously described), only an adjusted p-value cut-off was used. For pathway enrichment analysis, Sigora v3.0.5 [11] was used with the Reactome database at a hierarchy level of four.
Correction for multiple comparisons was done using the Bonferroni method, filtering results based on a value of ≤ 0.001. For additional information, refer to the companion methods paper:

Systems biology methods applied to blood and tissue for a comprehensive analysis of immune response to hepatitis B vaccine in adults.
Since blood samples taken for omics analysis were not collected at the same time for all participants, we used an EigenR2 analysis [12] to check the contribution of sampling time to the overall variance for the RNA-Seq data, and found it to be negligible (approximately 2%). First, the beta value distributions of the DNA methylation (DNAm) data was assessed using the R statistical software. Hierarchical clustering using either the entire EPIC array or 59 SNP probe's beta values was performed using the EPIC array beta values. Quality of the DNAm data was checked by filtering sites according to the following criteria: if probes were interrogated as a SNP, had evidence of cross hybridizing to a region of the genome other than the designed target, had a SNP present at the CpG target or single base extension of the probe (polymorphic probes) [13]. Probes were removed if they had a bead count ≤ 3 in 5% of samples, or had 1% of samples with a bad detection p value ≥ 0.05. The dasen method [14] was used for normalization, and variability associated with differing cell type proportions were corrected by regressing out the beta coefficients from linear regressions of the complete blood counts [15; 16]. Batch effects between Sentrix IDs were observed using Principal Component Analysis (PCA) where ComBat [17] was used to correct for Sentrix ID. Data from technical replicates were also assessed to ensure pre-processing and normalization procedures resulted in more comparable measures between replicates. 101,864 non-variable CpGs were removed as previously described [18] to reduce multiple test correction and improve computational efficiency. For additional information, refer to the companion methods paper: Systems biology of blood and tissue for comprehensive analysis of immune response to hepatitis B vaccine in adults.

Whole blood proteomic analysis
White blood cells (WBC) were collected from fresh whole blood were stored as dry pellet at -80°C until proteomic analyses were performed. WBC were lysed as previously described [19] then 10µg of WBC lysates and plasma were digested as previously reported [20]. Peptides were desalted, dried then chemically dimethylated with light, medium, and heavy formaldehyde [21].
After labelling, 2µg of peptides were injected into the EasynLC-1000 chromatography system (Thermo) and liquid chromatography and mass spectrometry analysis (LC-MS) were performed.
Data analysis was done using MaxQuant (v1.5.5.1) and the human UniProt database (downloaded on 15 July, 2017). Batch correction was performed on log2-transformed data using ComBat from the "sva" R package [22]. A paired t-test was used to test for differentially For all tests, proteins were filtered for significance using an adjust p-value of 0.05. For additional information, refer to the companion methods paper: Systems biology methods applied to blood and tissue for a comprehensive analysis of immune response to hepatitis B vaccine in adults.

Microbiome analysis
From fecal samples, bacterial community composition was analyzed via high-throughput sequencing of bacterial 16S rRNA amplicons, on the Illumina MiSeq platform using a previously described analysis pipeline [23]. Multivariate analyses via the mixOmics R package was used to relate microbiome composition to immune responses.
To identify and remove contaminating sequences, three extraction blanks were included. These were generated by first performing three extraction blanks, then using the eluate from these as a template for PCR amplification, and then sequencing in the same run as the biospecimens.
Two positive controls, consisting of cloned SUP05 DNA, were also included (number of copies = 2*10^6). OTUs present in blanks are either introduced from extraction reagents, PCR reagents, or from neighbouring samples during either the extraction or the PCR amplification process.
Contaminating OTUs were identified as being present in at least 50% of blanks, with count geometric mean plus one standard deviation was greater than the samples, reflecting previously described properties of contaminants arising from extraction or PCR reagents [24]. This approach also avoids removing common contaminating OTUs from neighbouring samples either during the extraction or PCR amplification steps, as these are of high relative abundance in the samples but low relative abundance or sporadic presence in the blanks. Contaminating OTUs were removed from the dataset. Samples with total read counts under 1000 after contaminant removal were also removed from further analysis. For additional information, refer to the companion methods paper: Systems biology methods applied to blood and tissue for a comprehensive analysis of immune response to hepatitis B vaccine in adults.

Data Preparation
Prior to integration by DIABLO, normalized data were further processed. First, for each dataset and feature (cells, CpGs, transcripts, proteins, etc.) in turn, we removed those missing (no measurable quantity reported) in >25% of samples. Next, for each dataset and feature in turn, missing values were imputed to 1/2 the lowest observed value. Finally, only samples with complete profiles across all datasets were retained for integration. This is a technical limitation of the chosen integration approach.

Model Fitting and Performance Evaluation
Hepatitis-B specific antibody titers, measured 30-days following the final of 3 vaccine doses administered (208 days after the first vaccine dose), were the response. Titers were centered and scaled (mean = 0, sd = 1) prior to use. Models were fit either to individual omic data (using sparse projection to latent structure regression; sPLS) or jointly in an integrative approach (using DIABLO). Model performance was evaluated using leave-one-out cross-validation and the mean squared error metric. We carried out limited exploration of the model hyperparameter spaces (n, the number of components, k, the number of features retained per component) for both sPLS and DIABLO. The estimated out-of-sample error rate was also compared to that of an uninformative model (whereby predictions were obtained by simply sampling with replacement from the responses and a mean squared error computed) in order to assess whether the demonstrated level of performance was useful.