Epigenetic DNA Methylation Profiling with MSRE: A Quantitative NGS Approach Using a Parkinson's Disease Test Case

Epigenetics is a rapidly developing field focused on deciphering chemical fingerprints that accumulate on human genomes over time. As the nascent idea of precision medicine expands to encompass epigenetic signatures of diagnostic and prognostic relevance, there is a need for methodologies that provide high-throughput DNA methylation profiling measurements. Here we report a novel quantification methodology for computationally reconstructing site-specific CpG methylation status from next generation sequencing (NGS) data using methyl-sensitive restriction endonucleases (MSRE). An integrated pipeline efficiently incorporates raw NGS metrics into a statistical discrimination platform to identify functional linkages between shifts in epigenetic DNA methylation and disease phenotypes in samples being analyzed. In this pilot proof-of-concept study we quantify and compare DNA methylation in blood serum of individuals with Parkinson's Disease relative to matched healthy blood profiles. Even with a small study of only six samples, a high degree of statistical discrimination was achieved based on CpG methylation profiles between groups, with 1008 statistically different CpG sites (p < 0.0025, after false discovery rate correction). A methylation load calculation was used to assess higher order impacts of methylation shifts on genes and pathways and most notably identified FGF3, FGF8, HTT, KMTA5, MIR8073, and YWHAG as differentially methylated genes with high relevance to Parkinson's Disease and neurodegeneration (based on PubMed literature citations). Of these, KMTA5 is a histone methyl-transferase gene and HTT is Huntington Disease Protein or Huntingtin, for which there are well established neurodegenerative impacts. The future need for precision diagnostics now requires more tools for exploring epigenetic processes that may be linked to cellular dysfunction and subsequent disease progression.

Epigenetics is a rapidly developing field focused on deciphering chemical fingerprints that accumulate on human genomes over time. As the nascent idea of precision medicine expands to encompass epigenetic signatures of diagnostic and prognostic relevance, there is a need for methodologies that provide high-throughput DNA methylation profiling measurements. Here we report a novel quantification methodology for computationally reconstructing site-specific CpG methylation status from next generation sequencing (NGS) data using methyl-sensitive restriction endonucleases (MSRE). An integrated pipeline efficiently incorporates raw NGS metrics into a statistical discrimination platform to identify functional linkages between shifts in epigenetic DNA methylation and disease phenotypes in samples being analyzed. In this pilot proof-of-concept study we quantify and compare DNA methylation in blood serum of individuals with Parkinson's Disease relative to matched healthy blood profiles. Even with a small study of only six samples, a high degree of statistical discrimination was achieved based on CpG methylation profiles between groups, with 1008 statistically different CpG sites (p <0.0025, after false discovery rate correction). A methylation load calculation was used to assess higher order impacts of methylation shifts on genes and pathways and most notably identified FGF3, FGF8, HTT, KMTA5, MIR8073, and YWHAG as differentially methylated genes with high relevance to Parkinson's Disease and neurodegeneration (based on PubMed literature citations). Of these, KMTA5 is a histone methyl-transferase gene and HTT is Huntington Disease Protein or Huntingtin, for which there are well established neurodegenerative impacts. The future need for precision diagnostics now requires more tools for exploring epigenetic processes that may be linked to cellular dysfunction and subsequent disease progression.

INTRODUCTION
Non-infectious diseases were once considered to be the result of faulty genes. The push of the last 50 years to decode the human genome carried the hope that we would be able to identify gene mutation errors and ultimately develop corrective gene therapies. Identifying the associations between diseases and genetic variants has been a major challenge to improving human wellness (Weitzel et al., 2016). The hope is that sequencing genomes of thousands of people and matching genome sequences to medical histories will reveal connections between genome sequence variations and root causes of diseases (Sudmant et al., 2015).
Epigenetic modifications of genomes are now recognized as playing central interacting roles in genetic determinants of health and disease (Ladd-Acosta and Fallin, 2016). Knowledge of epigenetics is essential to predict integrated genetic processes (Biémont, 2010). DNA methylation occurs on cytosines in a cytosine-guanine dinucleotide motif (CpG) and is one of the important forms of DNA epigenetic modification (Feng et al., 2010;Jones et al., 2010;Cyr et al., 2013;Nielsen et al., 2016). In general, when methylation is present gene activation is suppressed. This gene expression regulation directed by epigenetics in a large part explains why an individual is not simply a pre-programed reflection of a parentally inherited genome, but instead during their life they can develop alternative paths toward different outcomes of health and wellness.
However, there is increasing uncertainty about how to best profile site-specific CpG methylation. Despite the historical prominence of bisulfite oxidation mutagenesis as a technique, an issue of the journal Methods in 2015 presented multiple approaches and ideas to make these measurements (Meissner, 2015). There is a clear need and opportunity for methods development in the field epigenetics.
In this paper we present a pilot feasibility study for the development of a high-resolution DNA methylation profiling technology based on computational reconstruction of the probabilities of cytosine methylation states from next generation sequencing (NGS) data. A small trial study of Parkinson's Disease (PD) subjects was selected for application of the methylation profiling and analysis tool set. Parkinson's Disease affects more than 200,000 individuals yearly in the US. Although PD is currently incurable, medications can control symptoms. The neuropathology of symptomatic-onset PD is well established (Houlden and Singleton, 2012;Poulopoulos et al., 2012). Unfortunately, at the point where a diagnosis can be made with high confidence, neurodegeneration has generally progressed 5-10 years (Miller and O'Callaghan, 2015). Thus, most intervention therapies post-diagnosis have limited efficacy to alter PD progression because heavy neuronal damage has already caused up to 60-80% loss of nigral dopamine neurons (Zwagerman and Richardson, 2013;Miller and O'Callaghan, 2015). There is an urgent need to develop early diagnostic PD biomarkers that would enable clinicians: (i) to intervene at the beginning of the course of neurodegeneration, and (ii) to monitor progress of therapeutic treatment responses in individual patients.
Given recent insights into potential methylation biomarkers in PD (Masliah et al., 2013;Pihlstrøm et al., 2015;Tan et al., 2016), we apply a novel NGS approach to profile blood DNA that reveals subtle but quantitative shifts in epigenetic DNA methylation profiles correlated with early symptomatic diagnoses. Blood was selected as the genomic DNA source because of the ease of collection that would make a screening or diagnostic test easy and accessible to all patients. We know lymphocytes are the largest source of DNA in whole blood (10 6 cells per ml) and recent work has shown that circulating lymphocytes in PD patients evidence quantifiable changes in cellular activities (Masliah et al., 2013;Alberio et al., 2014;Colasanti et al., 2014;Grozdanov et al., 2014;Ide et al., 2015;Tan et al., 2016). Thus, we apply a novel highresolution, quantitative methylation measurement technology to test the efficacy of this approach to reveal epigenetic profiles of PD in blood.

Sample Processing
A small sample size was chosen for this initial feasibility study as a 3 vs. 3 comparison of early-age PD males (symptomatic in their mid-30's) and healthy males (n = 6). Our logic was that if a blood-based epigenetic signal could to any degree be ascertained in a cohort of only six individuals, then it would justify the time, effort and expense of pursuing a much larger validation study. Thus, the goal of this project was to simply assess if a methylation signal could be discerned.
For this pilot study, blood serum samples from the Coriell Institute Biorepository (www.coriell.org) were selected for three males with early-age Parkinson's Disease (#ND20108, #ND24665, and #ND28170). Three healthy males in the same age and demographic groups were selected as controls (#ND16294, #ND19290, and #ND24642). Descriptions of each patient sample are available via the Coriell Institute's online catalog system. Genomic DNA (1 µg each individual; comprised mostly of B-cells) was shipped directly to commercial sequencing providers. Sequence data reported here were obtained from the Yale Core Center for Genomic Analysis (Illumina HiSeq 2500; New Haven, CT).
Fragmentation of the gDNA samples followed standard Illumina protocols except for an additional restriction enzyme digest step at the beginning of the work flow. A 1 µg aliquot of the DNA was first cut with a single methyl-sensitive restriction endonuclease, HpaII. Once digested, DNA was washed with Qiagen's QIAquick PCR Purification Kit and sheared to a median size of 300 bp using a Covaris AFA sonicator. DNA libraries were prepared using Illumina's Sample Prep Kits, and 71-cycle single-read sequencing was performed on genomic DNA libraries using Illumina's sequencing by synthesis (SBS) technology. Fragmentation, library preparation and quality control checks were all performed by the sequencing provider using industry standard protocols.

Methylation Quantification
Overall, our strategy was derived from an earlier approach for assessing DNA methylation in the late 1990's that used differential restriction fragment profiles to assess cytosine methylation sites (Reyna-López et al., 1997). Over the last decade, we have developed a computational approach to harness the power of methyl-sensitive restriction enzymes with the base-pair specific precision of NGS to statistically reconstruct CpG site-specific methylation in heterogenous cell population samples (Marsh and Pasqualone, 2014). In the hg19 reference genome, there are 2.4 million HpaII sites that can be quantified simultaneously in this approach. In general, this approach falls into the category of "MSRE-Seq" methods. There have been other methodological implementations of methylation-sensitive assays like this, most recently "DREAM" in 2012 (Jelinek et al., 2012;van Esterik et al., 2015;Bouwmeester et al., 2016), but the challenge of assessing the probability of methylation events from NGS sequence data across whole-genomes has limited earlier work from being successfully adopted in the field beyond a few laboratories.
Quantification of CpG methylation profiles and subsequent statistical processing were performed on a commercial bioinformatic pipeline and software platform (Genome Profiling LLC, Newark, DE). The gDNA preparation protocol is non-mutagenic and when coupled with the analytics platform enables concurrent gene variant and epigenetic analyses from a single NGS run. Both GATK (Broad Institute, Cambridge, MA) and Issac (Illumina, San Diego, CA) variant call pipelines have been used to extract SNP and INDEL data from these NGS fastq files in addition to the quantitative methylation profiles. Ultimately these high-resolution measurements of fractional methylation at CpG sites across a genome provide a deep data resource for integrating profile data with potential functional shifts in specific genes from know pathways.

Software Platform
The integrated platform workflow is managed via a shell pipeline running python and R scripts, starting from raw-read QC filtering of fastq sequence files to final analyses of methylation profiling differences between patient groups. The platform performs the following tasks: (1) quality control to filter sequence tags for >97% read confidence, (2) isolation of target sequence reads, (3) sequence compression to reduce complexity, (4) read alignment to hg19 reference genome, (5) CpG quantification for 5'-methyl-cytosine site distributions, (6) methylation profiling comparison between patient groups, and (7) output of data plots, tables and analyses in a report package. The platform consists of a series of python scripts optimized for distributed processing on a multi-core server. All analyses were performed using virtual machines under Amazon's Web Services cloud infrastructure.
There are two metrics recovered from the raw sequence read files that are based on independent characteristics of DNA fragmentation via restriction digests and random shearing and local scale differences in coverage bias. One metric is proportional to the probability that a CpG site is fully methylated in a heterogenous sample (all gDNA copies present in the sample extraction) and the other is related to the probability that the same CpG site is fully unmethylated. These metrics are derived for each CpG site and are utilized in subsequent statistical analyses. This multi-measurement approach provides greater statistical sensitivity and robust error control. Using ordination analytics to partition measured differences in CpG states into discrete vectors (e.g., non-metric multidimensional scaling; see below) allows for the identification of the most significant shifts in CpG methylation states among subject groups.
All sequence data reported here are publicly available by contacting the lead author or through the Sequence Read Archive of the US National Center for Biotechnology Information (Accession #PRJNA342035).

Methylation Load
Analytics are executed at three levels of comparison: individual CpG [i] methylation, individual genes by summation across measured CpG sites, and higher-order pathways by assessing methylation status of component genes. This is the concept of methylation load across a defined sequence (Ordway et al., 2007;Hogenbirk et al., 2013) and serves to highlight functional differences that may exist when comparing different subject groups. Here, we utilize Differential Methylation Load ( ML), which is a summation of site-specific differences in CpG methylation that are then summed across a gene or gene domain or combination of genes (pathways). This is simply calculated as the signed sum of the difference in %MET scores for each CpG site within the defined gene or region being scored: (1) where the differential methylation load of the jth gene or pathway is calculated across all ith CpG sites contained within that gene or pathway. This difference is 'signed' and always calculated as NORM − PD so that positive numbers reflect higher methylation loads in healthy males and negative numbers reflect higher methylation in the PD subjects.

Differential Methylation
One approach employed here utilizes differential gene expression (DGE) analyses using the R package, edgeR, (Bioconductor) (Robinson et al., 2010;Zhou et al., 2014). Using DGE tools for methylation analysis is common with other open source tools, like RnBEADS (Assenov et al., 2014), where thousands of quantitative variables are compiled across a much smaller number of samples (<50). Methylation score data is well suited for such an analysis because of the limited range of response (two orders of magnitude) in the quantitative variables in contrast to the higher range of responses that exist in DGE data sets. Also, these DGE analytics packages have very well developed falsediscovery rate calculations and adjustment procedures. Briefly, a methylation data table containing normalized "counts" (i.e., methylated cytosine counts normalized to total cytosines at specific CpG sites) for all six sample libraries was loaded into an edgeR library in R. These data were used as the source from which a DGEList (Digital Gene Expression List) object was created. Differential response between two sample groups was calculated for single CpG sites using a site-specific false discovery rate correction applied to each pairwise comparison.

Ordinate Analyses
An ordination procedure called non-metric multidimensional scaling (NMDS, Legendre and Legendre, 2012) analysis was used to analyze integrated pattern discrimination in methylation profiles. This technique is like a principal components analysis. The primary difference between DGE and NMDS is that DGE analysis is executed at a pairwise level across separate, individual CpG methylation score values, while in NMDS all CpG sites for each sample are integrated as one pattern and the patterns are then compared across samples to identify CpG sites that are conserved within a sample group while also divergent between different sample groups. The R package vegan contains several robust ordinate analyses.

Hierarchical Clustering
To assess pattern similarities between individual CpG sites, we used the R package pv-clust to run a hierarchical clustering tree with iterative bootstrap support (n = 1000) using only the top 40 CpG sites that provided the most discriminating power between sample groups in the NMDS analysis.

Graphics
The R package ggplot2 is extremely versatile at handling a broad range of data types and graphic formats (Wickham, 2009). The Perl module circos is an essential tool for representing multiple data tracks at very dense genomic scales (Krzywinski et al., 2009).

Methylation Quantification and Profiling Analyses
Interest in DNA methylation profiling technologies has grown rapidly in recent years. One of the first methods reviews appeared only 6 years ago (Laird, 2010). Within the last year, one issue of the journal Methods was devoted entirely to DNA methylation profiling methods, highlighting the variety of work that is under development (Meissner, 2015). The Methods review pointed to several gaps in current approaches, the key being the transition from qualitative to quantitative data. In addition, comparing results between studies can be problematic when different profiling methods are used. Epigenetic profiling research is wide open for development of standard clinical diagnostic methods. The broad application of a standard approach would advance the quantitative understanding of the role of epigenetics in many diseases.
To be effective, screening tests for disease risks should utilize non-invasive sampling of easily obtained tissues. In addition, genetic and epigenetic profiles would be obtained and interpreted together, yielding a comprehensive genomic assessment. Recent work has shown the relevance of integrating DNA methylation profiles and DNA sequence analysis obtained from blood (both cell-free and cell fractions) (McClay et al., 2015). Here we assess the efficacy of a computational platform based on a methylsensitive restriction enzyme approach to recover epigenetic information in blood relevant to a known disease state in earlyage Parkinson's Disease.
It is important that a methylation profiling approach also yields variant calls (SNPs) with the same efficiency as a commercial standard whole genome sequence (WGS) profile. We used the "Genome in a Bottle" (GIAB) gDNA standard that has been developed by the US National Institute of Standards (NIST) to be used as a reference for assessing genome variant sequence analysis pipelines (Cornish and Guda, 2015). There were only minor differences in the number of SNPs recovered from a standard WGS library preparation performed by Macrogen USA and our restriction digest library protocol ( Figure S1). Both methods resulted in identifying 94% of the known SNP sites in GIAB across all chromosomes. Of those SNP sites, both methods yielded greater than 99% concordance with the known GIAB genotype. Thus, the fragmentation step via restriction digests during library synthesis does not impact the downstream efficiency of variant call analyses.
The repeatability of replicate measures of CpG methylation measurements between different replicate libraries is important. Two samples (two replicates each) were assayed on independent sequencing lanes on a HiSeq X10 (Macrogen USA). Measurement precision was expressed as the coefficient of variation (CV), which represents a percentage deviation from the mean (n = 2 independent libraries) for each CpG site scored by our analytical platform ( Figure S2). The frequency distribution of precision for duplicate gDNA samples reveals that over 70% of CpG sites have a CV deviation that is less than 2.5%. A cumulative frequency distribution shows that overall90% of the CpG sites were scored with a CV that is less than 10%, indicating that the quantitative methylation metrics are highly repeatable.
To assess raw CpG measurements, Figure 1 compares frequency distributions of methylation scores for each individual.
Here we see that the blood gDNA samples are distinctly bimodal for all individuals. There is a sharp peak of CpG sites centered around 90% methylation. This peak comprises about one third of the CpG observations for each sample. A second broader peak is evident around 60% methylation and comprises the remaining two thirds of the distribution in each sample. At this broad level of comparison there is no apparent difference between these healthy normal and Parkinson's Disease males.
The human reference genome (hg19) has 13 million CpG sites, including 2.4 million CpGs found in the sequence CCGG, which is the recognition sequence for HpaII. For analyses, a quick filter for each CpG site was executed to identify those for which the group means differed by more than 10%. The final data matrix consisted of 504,373 CpG sites. Because most of the HpaII CCGG sites are located in or near defined gene bodies, this data set yielded a deep level of gene-specific profiles.
As a first pass to assess potential patterns in methylation between groups, the CpG data matrix was merged with functional annotation information for each individual CpG site, combining gene information bed files with KEGG functional classification nomenclature (Kyoto Encyclopedia of Genes and Genomes). Using this integrative information, we explored potential functional impacts of these CpG patterns that could be related to disease stress or symptoms. This post hoc processing is best described as data mining because it leads to formulation of hypotheses about early disease phenotypes, rather than specific conclusions about possible mechanisms. This is true especially in light of the small number of individuals in this exploratory pilot study with the primary goal of assessing efficacy of this CpG methylation profiling platform.
An agglomerative metric was used to assess differential CpG methylation status between PD and healthy subjects at higher genomic levels. Differential methylation load ( ML) was calculated as a summation of the difference in %MET scores for each CpG site within a defined region or gene group being scored (Equation 1). Broad levels of methylation were assessed for KEGG functional classes and gene body domains ( Figure 2). Here, methylation load was distinctly different across gene domains and across functional KEGG groups. A large difference appears in Cellular Processes where 3 kb-upstream promoter domains of genes were more hypomethylated in PD individuals, suggesting functional differences in gene expression regulation within this broad gene/pathway group. Another noticeable difference appears in the Disease Response class where introns in this gene group are more hypermethylated in PD individuals. This class contains genes involved with many cell surface recognition and cell signaling processes predominant in lymphocytes. Overall, methylation loading on genes between PD and healthy individuals was not equivalent across domains and functional KEGG groups. Thus, there are distinct differences to be be pursued at lower levels of analysis.

Individual CpG Methylation Patterns
Distinct differences in Figure 2 warrant a detailed CpG sitespecific analysis. The R package edgeR was utilized because of the strong false discovery rate correction (FDR) that is calculated for each single CpG site comparison. Only minor adjustments in the input matrix were needed to translate the methylation scores into a format appropriate for a DGE analysis. The resulting differential library sizes for the six samples ranged from 1.71 to 1.77 M with a common dispersion factor of 0.034. Using an FDR adjustment (Benjamini and Hochberg, 1995), rather than a fixed threshold, the site specific dispersion estimates ranged from 0.001 to 0.231 with a median value of 0.035.
An exact pairwise T-test (edgeR) was performed using dispersion adjusted CpG variances (FDR threshold 0.05). Results for the top 40 CpG metrics sorted by P-value are shown in Table 1. In total, 1008 CpG sites were statistically significant at the p <0.0025 level (after FDR adjustment). Group differences were evenly split with 499 sites showing higher methylation in PD males while 509 sites had higher methylation in healthy males. Table 1 shows how the top distinguished sites compare in terms of log fold change ratios around 5.0 and highly significant FDR P-values. It is important to point out that the analysis employed the two metrics that are generated for each CpG site from the analytics platform instead of a single extrapolated % methylation value. These paired metrics are designated by a prefix "M" or "U" in the table and indicate values that are proportional to the probability that a CpG site is uniformly methylated on all genome copies in the sample and the probability that the site is uniformly unmethylated among all genome copies, respectively. Overall, these results suggest there is a small, defined set of CpG sites whose methylation status discriminates post hoc between healthy and PD males. The fact that these differences were detected in blood is concordant with other studies (Masliah et al., 2013;Alberio et al., 2014;Colasanti et al., 2014;Grozdanov et al., 2014;Ide et al., 2015;Tan et al., 2016). A heatmap of CpG methylation score patterns was generated with the R package gplots using the heatmap.2 function. This routine provides clustering across both rows (CpG [i] ) and columns (samples) with cell values represented on a green to blue color scale (Figure 3). Only CpG sites with an FDR adjusted P-value <0.0025 were used. Stark visual differences in color blocks point out two main contrasts: (1) low signal to midsignal, and (2) mid-signal to high signal. These contrasts arise in comparing both PD to Healthy and Healthy to PD, so there were four response groups of CpG sites evident in total. The color differentiation in Figure 2 underscores the low P-values in Table 1. Essentially, low sample variances in CpG measurements within a group (PD vs. healthy) establish a statistical foundation upon which error rates are extremely low.
A challenge with high-dimensional data sets arises from concerns about applying pairwise comparisons across thousands of non-independent data observations. To avoid this, we pursued a second analysis approach based on group pattern recognition, rather than multiple individual observation comparisons. Using the same input matrix as in the DGE based analysis, we applied an ordinate analysis technique, similar to principle component analysis, that employed an iterative algorithm solution using similarity rankings of CpG sites and repetitively applied coefficients to the derived response variables to locate a sample CpG data vector in a mutlidimensional space (Legendre and Legendre, 2012).
This approach, called non-metric multidimensional scaling (NMDS), established a clear and direct discrimination of methylation score profiles of PD and healthy males. The two sample groups are distinctly separated with the first two ordinate dimensions plotted as x, y coordinates (Figure 4). Here, differential CpG methylation patterns discriminated between healthy vs. PD epigenetic profiles with high confidence. The ellipses in the figure represent 95% confidence intervals around the true group mean locations in the ordinate space. The spatial FIGURE 3 | Dendrogram Heatmap of CpGs: p < 0.0025. Focusing on the most statistically significant CpG site changes, there are 1008 CpG sites with a methylation score difference with a probability of a type-I error of p < 0.0025 (after applying a false-discovery rate adjustment). The resolution between Parkinson's Disease patients and healthy males reveals quantitative separation among these groups of CpG sites. Each row represents the score values for a single CpG [i] site, with a hierarchical clustering association based on % methylation across all six individual samples in the study.
Frontiers in Genetics | www.frontiersin.org FIGURE 4 | CpG discrimination of Parkinson's individuals from healthy individuals. Using an ordinate analysis technique of non-metric multidimensional scaling (NMDS), CpG methylation profiles were compared among individuals to isolate patterns conserved within groups while also differing between groups (Parkinson's patients vs. normal healthy males). The first two component axes are plotted to locate the individual sample points in a 2D plane. Ellipses drawn represent 95% Confidence Intervals around the location of the true group means.
gap between the ellipses indicates that there is a very low probability that this separation between groups could have arisen by chance alone. Iterative bootstrap analysis of randomized data reveal a statistical probability of p < 0.0001. This separation shows a distinct, conserved DNA methylation shift evident in the blood profiles between PD and healthy males. The tight colocalization of the PD sample points in the NMDS plane indicates a strong and consistent cytosine methylation profile that is shared among all three PD patients, while the profiles of the heathy individuals reveal higher interindividual variation.
The NMDS analysis does not produce a P-value for significance of individual CpG sites because it is not a multiple-pairwise test for significance. It is a combined pattern analysis technique to establish the overall discrimination of the underlying data matrix to separate between patient groups. However, the contribution of each individual CpG site to the overall separation between sample groups can be recovered and allows for a ranking of the discrimination power of individual sites. Using the top 60 ranked CpG sites a hierarchical clustering with bootstrap replication was performed (Figure 5) to identify parallel patterns of methylation shifts across samples. The color shades in Figure 5 highlight lowest branches at which replicate support is >97%. There are several distinct groupings of CpG sites. The clustering indicates that the pattern of methylation among these groups is highly similar and may suggest a hypothesis of coordinate mechanisms of methylation status.

Gene Level Methylation Patterns
Clustering top discriminating CpG sites based on methylation patterns (Figure 5) may indicate potential sites with similar mechanisms of regulating methylation/demethylation. To assess whether such cluster associations indicate parallel methylation will require an analysis of how methylation shifts between subject groups compare at higher-levels of cell functioning. First, a methylation load summation ( ML) was utilized to compare methylation levels across defined genes between patient groups. FIGURE 5 | Hierarchical Clustering of Top Discriminating CpG Sites. The top ranked CpG sites that contribute to the group discrimination in the NMDS ordinate analysis are isolated and compared for similarity in methylation patterns across all individuals. Iterative boot-strap analysis (1000 reps) using the R package pv-clust established confidence levels of support for branches. Color shading represent the lowest order branch with significant association supported by >97% of the runs.
A list of the top 40 hypo-methylated genes in PD males ( Table 2) provides a look at genes that are the most under methylated relative to healthy males. Because of the general release of transcriptional gene controls associated with hypo-methylation, we can postulate that these genes are likely more active in PD than in healthy controls, but direct testing is necessary before further inferences can be made.
A top 40 list ( Table 2) is just a beginning point for an analysis with sufficient patient sample size (n = 20 to 50) to warrant digging deeper to look for markers and mechanisms of disease phenotype progression. The total number of annotated genes for which a ML score could be calculated was 4044. In a concerted biomarker study, an investigator would be able to parse deeper into ranked gene lists to assess the functional relevance of smaller shifts in methylation load. Also important to consider in a deeper study is the break out of ML into gene body structures, such as introns (miRNAs, transcript splice variants), CpG Islands (coordinated local domain shifts, retrotransposons), and 2 kb upstream promoters (transcriptional regulation). With a sufficient patient sample size, a break-down ML could be evaluated separately in different functional contexts within, or in proximity to, a defined gene body. Thus, total ML across an entire gene is a high-level integrated score that reveals "genes of interest, " and focus research attention on gene body targets for which an assessment of methylation shifts should be executed on a finer-scale. Genes are rank ordered by methylation load score ( ML) and are presented with PFAM and RefSeq name designations (where available) as well as KEGG pathway associations. The gene list has been filtered for functional pathway assignments. Hypothetical genes or genes with unannotated functions were not included in this rank listing.
Overall, there is a noticeable representation of genes involved with cell surface signaling events, cell-surface interactions, and even neuron functioning. Keeping in mind that the primary source of gDNA in these samples are B-and T-cells, plus much smaller amounts of circulating cell-free DNA, the strong "immune" based response suggests that in blood a traceable signal indicative of PD exists at a functional gene level and not just as independent CpG methylation events. A KEGG pathway descriptor is included in the table but many genes are active in several pathways and thus the primary descriptor does not always relate to the most proximal function a gene may contribute to via altered methylation states in this specific study. Given the small number of individuals and the range of annotation possibilities, the presented pathways should be considered as hypotheticals.
Similarly, a list of the top 40 hyper-methylated genes in PD males (Table 3) provides a look at genes that are the most Genes are rank ordered by methylation load score ( ML) and are presented with PFAM and RefSeq name designations (where available) as well as KEGG pathway associations. The gene list has been filtered for functional pathway assignments. Hypothetical genes or genes with unannotated functions were not included in this rank listing.
over methylated relative to healthy males, and thus likely to be less active or repressed. Again there is an intriguing list of genes associated with cell signaling, surface recognition, cellular defense and neuron function. To assist in parsing through these gene lists (Tables 2, 3) to identify significant features we employed a custom PubMed (NCBI) literature citation tool to screen all indexed publications on Parkinson's Disease and/or neurodegeneration. Instead of trying to manually assess each gene in the list, this approach uses PubMed as a knowledge base and efficiently determines literature support for each gene in both tables in association with Parkinson's Disease.
Parsing through the PubMed citations (inclusive of Abstracts), a threshold of >100 total publications was used to hunt for genes in Tables 2, 3 that have been prominently linked to or associated with Parkinson's Disease or neurodegeneration or  Tables 2, 3, the top methylated genes were used to search the PubMed citations and identify which of these highly differentially methylated genes were associated with prior research on neurodegenerative diseases and processes. neural function. Figure 6 presents the most prominent genes identified: FGF3, FGF8, HTT, KMTA5, MIR8073, and YWHAG. These genes all show substantial differential methylation in this study as well as being prominently represented in the literature with some connection to PD and/or neurodegeneration. In Figure 6, the field of PD research has exponentially increased from 1980 to 2016. The representation of genes in this publication group are shown as percent of total publication values, so the percentage of publication numbers in the lower panel are normalized to the exponential increase already, and increases in the percentage values are indicative of real increases in the representation of specific genes within this literature.
Of the six genes in Figure 6, two stand out more prominently. First, KMT5A or SETx, is a protein N-lysine methyltransferase that monomethylates both histones and non-histone proteins. Histone 4 (H4K20) is a target and is methylated during mitosis and represents a specific tag for epigenetic transcriptional repression (Kalakonda et al., 2008;West et al., 2010;Malik et al., 2015). Second, HTT, is Huntington Disease Protein or Huntingtin, for which there are well established neurodegenerative actions (Gusella et al., 2014;Kaliszewski et al., 2015;Kumar et al., 2015;Labadorf and Myers, 2015). Simple literature based searches are not definitive and there is a great deal of language complexity when parsing through abstracts targeting in on just a few specific words. However, a knowledge based approach is one of the most efficient ways to tackle problems like this where long lists of genes (1000's) can be efficiently scored for topical relevance.
An ideogram figure combines differential methylation load, CpG gene promoter densities, CpG clustering associations, and impacted genes (Figure 7). Here, ( ML) is presented as the outer track scatter plot and clearly shows how the vast majority of chromosome domains (at 1 MB intervals) are equivalently methylated between both groups (gray points). There are only a very few regions where differential methylation is substantially evident (red or green points), and these hyper-and hypomethylation deviations appear mostly in regions where CpG density is highest in promoter regions.
The arcs in Figure 7 present the clustering similarities for just the top 20 CpG sites (for visual clarity). The red arcs connect CpG sites that are the most closely correlated, the orange arcs connect second-best correlations. It is interesting to note that the highest correlated CpG sites do not occur in the same gene or chromosomal region. Their locations are separated across chromosomes, but these CpG sites are rarely located in domain regions where there is a large ML between groups. The genes in which the top CpG sites are located are labeled around the outside of the ring, color coded by whether that CpG site is more methylated in the healthy group (green) or more methylated in the PD group (red). Overall, mapping the changes across methylation states helps to assist in identifying potential or rational hypotheses for "mechanism of action" differences between patient groups.
The differential methylation responses of CpG sites integrated into a genome profile (Figure 7) shows the overlaying of multiple data tracks that are necessary to bring some focus to genomic-based differences. All these variables come into play in piecing together the epigenome profile and assessing the distinguishing characteristics that could define PD processes/risk apart from healthy subjects. This integrated plot helps to pin-point associations not evident in the large data tables to suggest associations, correlations and response mechanisms that we can mentally explore and determine if there are observational patterns that warrant further focus.

PD Markers
The slow progression of many neurodegenerative diseases leaves a window of time for therapeutic interventions that could potentially ameliorate disease symptoms, retard advance of the disease and/or even cure the disease. There is a clear advantage for early detection and intervention. To facilitate early diagnoses, the hunt for markers of Parkinson's Disease has traditionally focused on behavioral, cognitive and motor activity profiling (Mehta and Adler, 2016). Within the last decade, work is progressing to identify quantitative molecular signals in blood, cerebral spinal fluid, and other peripheral tissues to identify any systemic traces of neurodegenerative stress or activity in source tissues that could be readily obtained for a screening assay (Schneider et al., 2016).
The idea that a neurodegenerative disease signal can be detected within peripheral tissues other than the cerebral cortex is well established. There are active efforts looking at skin tests to detect signs of PD via alpha-synuclein (SCNA) in epidermal nerves (Donadio, 2014;Donadio et al., 2016;Rodríguez-Leyva et al., 2016). Even gastric and colonic mucosa have been profiled for markers (Chung et al., 2016). Blood is probably the most active area of PD marker research looking at a range of potential targets from IGF-I titers (Bernhard et al., 2016), micro-RNA signatures , mitochondrial densities (Pyle et al., 2016), vitamins (Ding et al., 2013;Ide et al., 2015), lymphocytes (Tan et al., 2014(Tan et al., , 2016 and proteins like alpha-synuclein (SNCA) (Pihlstrøm et al., 2015).
As the field of epigenetics has rapidly developed, some attention has turned to assessing the potential for this geneticbased chemical information system to serve as a diagnostic source for PD (Ai et al., 2012). Pioneering work has been conducted by Desplats et al. (2011) describing an interaction between SCNA and DNMT1 (methylation maintenance) that likely contributes to the broader hypomethylation shifts in neurons of PD patients. Shifts in DNA methylation are a definite component of PD. The work of Pihlstrøm et al. (2015) shows correlated methylation of SNCA from both cerebral cortex and blood within PD patients, so the SCNA signature of the disease can be traced in blood. Likewise, Masliah et al. (2013) demonstrates genome-wide correlates of genome methylation patterns between both brain and blood samples in PD patients. An epigenetic signal linkage is evident between brain and blood in PD. The ability to assess epigenetic proximal signals of neurodegeneration in peripheral tissues establishes the utility of using blood as a tissue source for PD marker assays. An epigenetic response within peripheral blood cells to PD progression within patients is demonstrable, especially in circulating cells like lymphocytes (Tan et al., 2014(Tan et al., , 2016. The functional role of white blood cells moving through circulatory and lymph systems makes them an effective sentinel target for disease response markers as they perfuse through tissues in a high state of signal sensitivity monitoring micro tissue environments.
Despite the apparent utility of assaying for DNA methylation markers in PD, there are only a handful of studies published in the literature in the last 5 years pursuing this avenue of research. The main limitations seem to be methodological problems with accurately quantifying gene methylation and the cost of NGS based assay tools needed to cover large patient cohorts. Although this present study is a feasibility test of a DNA methylation platform, the low variance and high discrimination achieved suggest that circulating lymphocytes in PD subjects may provide gene-based epigenetic markers of disease states. The majority of these markers are not likely to be proximally related to PD symptoms, but instead secondarily related to the stress and response of the body to the presence of neurodegenerative processes. This disease signal is likely to be unique in terms of an immune system response to maintain an individual's current state of health.

CONCLUSIONS
An important synthesis of Figure 7 is the potential understanding of how methylation signals may transfer molecular information from single CpG sites to genes to pathways to cellular function. This cross level linkage is an important evidentiary source for triaging a list of "mostinformative" CpG sites that could be combined into a targeted panel assay. The real end goal is to utilize early-stage epigenetic shifts that may indicate changing disease risks as a clinical screening tool. The first step is to combine 96 CpG sites into a high-throughput qPCR screening platform (technologies like Fluidigm, NanoString) and begin testing for efficacy among subjects with defined disease phenotypes. We are not at that point with this study here, but we are a step closer to that point with the technology platform that has generated the data presented.
At a functional level, we can trace patterns in differential methylation back to pathways and genes that support hypothesislevel ideas about potential roles of DNA methylation in altered molecular, cellular and physiological activities of PD. These "signals" are evident in blood even though they convey a distinct neurodegenerative signature. Even with a small pilot cohort, we statistically recover epigenetic profiles that discriminate between healthy and PD blood profiles, with a functional relevance in many of the comparative differences. The quantitative sensitivity and repeatability (across patients) of the DNA methylation metrics presented here are substantial enough to warrant further exploration with larger patient cohorts and with the goal of trying to establish a blood-based marker screening test for early age Parkinson's Disease.
Overall, the results of this profiling work reveal a large number of differentially methylated CpG sites that can be identified with high statistical confidence. With a larger subject cohort, these results would support the design and development of a targeted panel assay of select CpG sites using qPCR measurement platforms. Ultimately, targeted panels for detecting shifts in sitespecific CpG methylation could make it possible to diagnose early onset stages of impending diseases or disease risks by identifying pre-symptomatic genomic changes via epigenetic mechanisms.

AUTHOR CONTRIBUTIONS
This work was conceived and designed by: AM and MG. AM performed the experiments. MC and AM analyzed the data. MG, AM and MC contributed reagents/materials/analysis tools. AM and MC wrote the first draft of the paper. All authors read and approved the final draft.

FUNDING
The development of this work was supported by grants from the US National Science Foundation: #0944557 to AM, #1316055 to co-PI AM, and #1355306 to AM. Additional funding from Genome Profiling LLC covered costs of sequencing and computation.

ACKNOWLEDGMENTS
We thank: the Micheal J. Fox Foundation and Dr. P. DesPlats (UCSD) for informative discussions on the application of DNA methylation profiling to Parkinson's Disease; The National Institute of Neurological Disorders and Stroke (NINDS) Human Genetics Repository administered by the Coriell Institute for providing well characterized samples; D. Landais, Ph.D., for providing extensive comments and guidance during the course of this study.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fgene. 2016.00191/full#supplementary-material Figure S1 | Efficiency of Genome Variant Analyses. The NIST standard "Genome in a Bottle" was processed using a commercial sequencing lab (Macrogen USA) comparing standard whole genome library preparation and downstream variant analysis (WGS) to a HpaII fragmentation prior to library preparation and then normal sequencing and variant analysis (GenPro). Both methods recovered 94% of the known SNP variants in the Genome in a Bottle genotype.