Methylation Data Processing Protocol & Comparison of Blood and Cerebral Spinal Fluid Following Aneurysmal Subarachnoid Hemorrhage

One challenge in conducting DNA methylation-based epigenome-wide association studies (EWAS) is the appropriate cleaning and quality-checking of the methylation values to minimize biases and experimental artifacts, while simultaneously retaining potential biological signals. These issues are compounded in studies that include multiple tissue types, and/or tissues for which reference data are unavailable to assist in adjusting for cell-type mixture, for example cerebral spinal fluid (CSF). For our study that evaluated blood and CSF taken from aneurysmal subarachnoid hemorrhage (aSAH) patients, we developed a protocol to clean and quality-check genome-wide methylation levels and compared the methylomic profiles of the two tissues to determine whether blood is a suitable surrogate for CSF. CSF samples were collected from 279 aSAH patients longitudinally during the first 14 days of hospitalization, and a subset of 88 of these patients also provided blood samples within the first two days. Quality control (QC) procedures included identification and exclusion of poor performing samples and low-quality probes, functional normalization, and correction for cell-type heterogeneity via surrogate variable analysis (SVA). Significant differences in rates of poor sample performance was observed between blood (1.1% failing QC) and CSF (9.12% failing QC; p = 0.003). Functional normalization increased the concordance of methylation values among technical replicates in both CSF and blood. Likewise, SVA improved the asymptotic behavior of the test of association in a simulated EWAS under the null hypothesis. To determine the suitability of blood as a surrogate for CSF, we calculated the correlation of adjusted methylation values between blood and CSF globally and by genomic regions. Overall, mean correlation (r < 0.26) was low, suggesting that blood is not a suitable surrogate for global methylation in CSF. However, differences in the magnitude of the correlation were observed by genomic region (CpG island, shore, shelf, open sea; p < 0.001 for all) and orientation with respect to nearby genes (3’ UTR, transcription start site, exon, body, 5’ UTR; p < 0.01 for all). In conclusion, the correlation analysis and QC pipelines indicated that DNA extracted from blood was not, overall, a suitable surrogate for DNA extracted from CSF in aSAH methylomic studies.


Study Design Overview
Our study population is comprised of individuals who have sustained an aSAH. Patient 24 DNA was obtained from two biological tissues, CSF (drained as standard of care) and blood. This 25 study investigated CSF samples collected longitudinally from 279 patients during the first 14 days 26 of hospitalization, and blood samples from 88 of these individuals collected within the first day of 27 hospitalization. Methylomic profiles were obtained using a genome-wide array, from which 28 methylation levels, quantified as beta-values (i.e., percent methylation) and M-values (i.e., a 29 transformation of the beta-values, which exhibit beneficial properties for statistical analysis), were 30 assessed for over 450,000 cytosine-phosphate-guanine (CpG) sites. QC analyses of methylation 31 23 The severity of aSAH was assessed by Fisher grade (Fisher, Kistler, and Davis 1980) 24 employing CT scan to assess hemorrhage burden and by Hunt and Hess scores (Hunt and Hess 25 1968) to assess symptom burden. Demographic and anthropometric characteristics such as age, 26 sex, race, height, and weight were collected from medical records (Table 1). Smoking status was 27 also collected. University. Each BeadChip, hereafter referred to as a plate, consists of eight chips of 12 samples 5 arranged in a layout of six rows by two columns. This enables 96 samples to be run on a single 6 plate. To avoid plate effects, all blood samples were assayed together on a single plate. CSF 7 samples were placed across 11 plates using several strategies to reduce the impact of technical 8 artifacts. First, all longitudinal samples from the same patient were included on the same chip 9 within the same plate so that longitudinal changes in methylation were not obscured by chip and 10 plate effects. Second, row and column positioning of samples from the same patient were carefully 11 assigned to available positions within a chip so that longitudinal changes in methylation were not 12 confounded with row and column effects. Third, cases and controls for DCI were balanced within 13 chips using a checkerboard pattern so that DCI was not confounded with row, column, chip, or 14 plate effects (see Supplemental Figure 1 for the plate map). To gauge technical variation, we 15 included four control samples of fixed methylation state (0%, 30%, 70%, and 100% methylated) 16 and four technical replicates (i.e., repeated assays of the same DNA sample) per plate. Two of the 17 control samples were placed in the same position across all plates and two were randomly placed. 18 For the plate of blood samples, all four technical replicates were randomly positioned duplicates.

19
In contrast, for the 11 plates of CSF samples, three of the four technical replicates were randomly 20 chosen duplicate samples, and one was the same sample replicated across all 11 plates.  23 ENmix (Xu et al. 2016) was employed to assess the quality of samples in our methylation 24 study, separately for blood and CSF samples. Samples having bisulphite control intensities less 25 than 3 standard deviations below the mean of all samples, and/or for which more than 1% of probes 26 were inadequately detected (i.e., detection p-values > 0.01 or with fewer than 3 beads) were 27 categorized as low-quality samples. These, along with outliers in total intensity or beta value 28 distribution were removed from our subsequent analyses (Xu et al. 2016). After the removal of 29 low-quality and outlier samples, we performed background correction (Xu et al. 2016) to remove 30 non-specific signals from the total signal, and performed dye bias correction (Xu et al. 2017). Sample quality differences by tissue type were tested using Fisher's exact test on counts of samples 1 passing or failing all sample QC filters. 2 We normalized the methylation data to bring Infinium Type I and Type II probes into 3 alignment and to reduce noise and technical variation due to batch effects (i.e., plate, chip, row, 4 and column effects). Specifically, we performed functional normalization, an extension of quantile 5 normalization, which makes use of the control probes on the array to regress out unwanted 6 variation in the methylation data (Fortin et al. 2014). Whether functional normalization improved 7 agreement between technical replicates was tested by comparing the squared differences in median 8 M-values between technical duplicates before and after normalization using a one-sided paired t-9 test.

12
After normalizing the data, we removed CpG sites from our analysis due to: (1) overlap of 13 methylation probes with known polymorphic sites (which can cause biased methylation    Owing to the lack of reference methylation data for cell types found in CSF after an aSAH 2 event we employed surrogate variable analysis (SVA) to perform reference-free adjustment for 3 cell-type heterogeneity across the samples in blood and CSF data. SVA, as implemented in the sva 4 R package (Leek et al. 2012), simultaneously models the effects of known sources of variation 5 (covariates) and unknown sources of variation (i.e., surrogate variables), conditional on a 6 phenotype of interest. Including the phenotype of interest in this modeling approach is necessary 7 to prevent the surrogate variables from accounting for variation due to, for example, differences 8 between cases and controls of disease, so as not to stymie subsequent analyses aimed at detecting 9 CpG sites associated with case/control status. For examining the utility of surrogate variables in 10 adjusting for cell-type heterogeneity in the absence of any particular phenotype-specific analyses, 11 we generated a random trait by randomly permuting one of our observed traits, DCI, to serve as 12 our outcome of interest. SVA was performed for this simulated trait along with age and gender as 13 covariates in the context of an EWAS, whereby each CpG was individually tested for association 14 with the simulated trait. Given the repeated measures in CSF, we grouped the CSF samples into 15 five subsets centered on their target days (days 1, 4, 7, 10 and 13) and substituted samples +/-1 16 day when a sample on the target day was unavailable. The goal of performing SVA cross-17 sectionally in CSF subsets is to retain the variation in methylation related to time. EWAS was also 18 performed for the simulated trait without adjusting for surrogate variables and the distribution of 19 p-values for SVA-adjusted and unadjusted EWAS scans under the null hypothesis were 20 qualitatively compared to determine effect of SVA on genomic inflation. We measured 21 inflation/deflation using the genomic inflation factor (λ), which is defined as the ratio of the 22 empirically observed to expected median of the distribution of the test statistic.  25 We compared the methylation profiles of individuals with blood samples collected within 26 the first day after hospitalization and CSF samples collected at day 1, 4, 7, 10 and 13. We used 65, 27 64, 65, 61 and 47 subjects to compare the methylation profiles of blood and CSF at day 1, day 4, 28 day 7, day 10 and day 13 respectively to facilitate individual level comparison. For this 29 comparison, we excluded CpG sites with a methylation beta value less than 10% or greater than 30 90% from all CpGs that passed QC, as methylation at these sites had little variation across samples and would not be informative for the analysis. The M-values at each qualifying CpG site were 1 adjusted for age, sex and the surrogate variables using the aforementioned random trait to remove 2 unwanted variation, and were then used to calculate correlation coefficients between the blood and 3 CSF profile.  (Table S1). QC analyses and filtering procedures were performed separately 13 for CSF and blood samples. Based on low average bisulphite intensity and/or high proportion of 14 poorly detected probes, we identified 89 (of 1012; 8.8%) poorly performing CSF samples ( Figure   15 1). Additionally, we identified 3 (0.3%) more CSF outliers based on low total intensity. In contrast, 16 no blood samples (0 of 92; 0%) failed these criteria. Figure 2 displays the beta-value distributions 17 of all samples collected, based on which one blood and one additional CSF samples were identified 18 as outliers. In total, poor sample performance was more common for CSF (93 of 1,012, 9.1%) than 19 for blood (1 of 92, 1.1%), and these differences in quality of methylomic profiling by tissue type 20 were statistically significant (Fisher's exact test p = 0.003). Table S1 gives counts of all samples 21 collected and samples retained after QC, for each collection time day.

22
After removing low-quality samples, we performed functional normalization to reduce 23 probe type (Infinium Type I vs. Type II) and batch (i.e., plate, chip, row, and column) effects. The 24 reduction in chip, row, and column effects can be visualized in the distribution of M-values, before 25 and after functional normalization, for samples profiled together on a plate (Figure 3). Row effects 26 are apparent for some chips as increasing means across adjacent samples. For example, before 27 normalization the third chip from the left in Figure 3A shows strong row effects indicated by means 28 forming an upwardly sloped trend across the first to fifth samples (which correspond to ascending 29 rows in the first column), followed by another upwardly sloped trend across the sixth to eleventh 30 samples (which correspond to ascending rows in the second column). Functional normalization increased concordance in median methylation between 34 technical replicate CSF samples (p = 1 0.015) (Figure 4). For the 4 technical replicate blood samples, the same trend of increased 2 concordance after functional normalization was observed; however, this trend was not statistically 3 significant (p = 0.153).  Based on QC analyses, CpG probes with multimodal beta-value distributions, low detection 10 quality across samples, and high technical variation across replicate samples were also filtered out 11 of analyses. CpG probe-level filtering criteria are summarized in Table 2. For each QC filtering 12 step, and overall, fewer CpGs were filtered out in blood than in CSF (p < 2.2 x 10-16 for all), 13 indicating that CSF samples may yield somewhat lower-quality methylation data, as is also evident 14 in Figure 1.

26
Because methylomic profiles differ widely by cell type, modeling cell type heterogeneity 27 across samples is crucial for valid cross-sample analyses of methylation data. However, external 28 cell type-specific reference data was not available for post-aSAH CSF for use in reference-based 29 adjustment. Therefore, we performed reference-free adjustment using SVA to remove unknown 30 sources of variation including cell type heterogeneity. We further excluded technical replicates from all samples that passed QC, leaving 70 blood samples and 154, 246, 217, 152, and 95 CSF 1 samples for days 1, 4, 7, 10 and 13 respectively. Ten surrogate variables (SVs) were generated for 2 the set of blood samples, and 13 SVs were generated for day 1 CSF samples. Fifteen, 15, 14 and 3 10 surrogate variables were generated for CSF samples for day 4, 7, 10 and 13 respectively. To   18 Following our long-term goal of understanding the methylomic changes occurring across 19 tissues after aSAH, we explored the suitability of peripheral blood collected within the first day of 20 hospitalization as a surrogate for the normally less accessible longitudinally collected CSF based 21 on the correlation of adjusted M-values between the two tissue types obtained from aSAH patients.

22
Specifically, we compared the methylation profile of blood collected within 48h of hospitalization 23 versus CSF samples collected at day 1, 4, 7, 10 and 13 post rupture respectively. Table 3 24 summarizes the numbers of CpGs used and the correlation coefficients for each day. In general, 25 the mean correlation (0.23 -0.26) was too low to use blood as a surrogate for post-aSAH CSF in 26 a global manner. 27 Differences were observed in the magnitude of the correlation by genomic position (CpG 28 island, shore, shelf, and open sea; p < 0.001 for all), with islands and shores showing greater 29 positive correlation than shelves and seas ( Figure 6, Supplemental Figures 4-7). Similarly, the   We also explored the question of whether methylomic profiles from blood samples could 21 serve as surrogates for less accessible CSF. Though significant positive correlations were 22 observed, especially for regulatory regions such as CpG islands and locations near transcriptional 23 start sites of genes, globally, the correlations in methylation values between blood and CSF were 24 too low for blood to serve as a useful surrogate for most scientific or clinical purposes. However, 25 to understand the methylomic changes that occur post-aSAH, we believe that CSF would be a most 26 relevant source, representing the central nervous system (CNS) environment and its proximity to 27 the hemorrhagic location.

28
This study benefited from several strengths including the thoughtful plate design aimed at 29 reducing confounding of experimental effects with technical artifacts, thorough and rigorous 30 application of data QC procedures, pairing of blood and CSF samples from the same patients, and assessment of methylomic profiles in a novel tissue type (post-aSAH CSF) that captures the CNS 1 environment post-aSAH. The study is also novel as this is the first to investigate methylation 2 patterns in DNA extracted from CSF over a longitudinal period after aSAH. Despite these 3 strengths, limitations of the current study include limited statistical power to resolve the intra 4 subject differences among the samples that may ultimately pose challenges in using this dataset 5 for future EWAS studies. Additionally, the cell composition of CSF may vary over time after 6 hemorrhage, which would also affect the methylation levels. Thus, longitudinal analyses of post-7 aSAH samples are challenging as cell-type heterogeneity may be confounded with days post 8 injury. Overcoming these challenges will be necessary to accomplish goals such as identifying 9 genes whose changes in methylation after injury are predictive of recovery outcomes. 10 In conclusion, this study is one of the first attempts to investigate DNA methylation at the 11 genome scale in a sample of aSAH patients, as well as one of the first to measure methylation in 12 CSF. Our analysis protocol showed that methylomic profiles can be obtained from CSF for use in 13 EWAS analysis and that QC steps can improve the analysis by eliminating low-quality data points 14 and reducing biases and experimental artifacts. Likewise, we show that blood, while readily 15 accessible, is not a sufficient surrogate for the methylomic status of CSF. Ultimately, efforts to 16 understand methylation profiles in aSAH patients, and changes that occur post-injury, may lead to 17 the discovery of biomarkers of clinical utility in predicting patient recovery.
Foremost, we thank the participants of this study for making this work possible. Funding 11 for this study was provided by the National Institute of Health (R01NR013610). We thank UPMC 12 for access of the clinical samples. 13 14 Public Data Access 15 The data can be accessed through dbGAP: phs001990.v1.p1.       between count sum and sample size.