MGMIN: A Normalization Method for Correcting Probe Design Bias in Illumina Infinium HumanMethylation450 BeadChips

The Illumina Infinium HumanMethylation450 Beadchips have been widely utilized in epigenome-wide association studies (EWAS). However, the existing two types of probes (type I and type II), with the distribution of measurements of probes and dynamic range different, may bias downstream analyses. Here, we propose a method, MGMIN (M-values Gaussian-MIxture Normalization), to correct the probe designs based on M-values of DNA methylation. Our strategy includes fitting Gaussian mixture distributions to type I and type II probes separately, a transformation of M-values into quantiles and finally a dilation transformation based on M-values of DNA methylation to maintain the continuity of the data. Our method is validated on several public datasets on reducing probe design bias, reducing the technical variation and improving the ability to find biologically differential methylation signals. The results show that MGMIN achieves competitive performances compared to BMIQ which is a well-known normalization method on β-values of DNA methylation.


INTRODUCTION
DNA methylation, as a well-known epigenetic marker, plays an essential role in biological processes and complex genetic diseases like cancer and diabetes (Irizarry et al., 2009;Paul et al., 2016). The Illumina Infinium HumanMethylation450 (450K) BeadChip (Bibikova et al., 2011) provides measurements of the level of methylation at over 480K CpG sites and has been widely used in epigenome-wide association studies (EWAS) and large-scale projects, such as The Cancer Genome Atlas (TCGA). The probes in the Infinium 450K BeadChip come in two different designs, type I (n = 135,501) and type II (n = 350,076), in order to increase the genomic coverage of the assay. However, the methylation values (β-values or M-values) derived from the two types of designs exhibit different distributions. Particularly, the type I probes possess a larger range of measurement than the type II probes (Dedeurwaerder et al., 2011). The differences between the two types of probe designs may impact the downstream analyses.
Several approaches have been published to correct the probe design bias. A peak-based correction (PBC) method normalizes type II probes to render them comparable with type I probes (Dedeurwaerder et al., 2011). In fact, PBC gets poor performance when the density distribution of methylation values does not show well-defined peaks. SQN (Touleimat and Tost, 2012) and SWAN (Maksimovic et al., 2012) select subset of probes with similar biological category to adjust the probe design bias. Beta MIxture Quantile dilation (BMIQ) is a model-based normalization approach to correct β-values of type II probes according to the beta distribution of β-values of type I probes, which appears to outperform PBC, SQN, and SWAN (Teschendorff et al., 2012).
In this work, we propose a method to correct the probe design bias based on the Gaussian Mixture Model (GMM) of the Mvalues of DNA methylation, which is called M-value Gaussian-MIxture Normalization (MGMIN). The method includes three steps: (i) fit Gaussian-mixture distributions to type I and type II probes separately, (ii) utilize a transformation of M-values into quantiles, (iii) perform a dilation transformation based on M-values to maintain the continuity of the data. We evaluate MGMIN using several independent datasets in terms of reducing the replicate technical variance and correcting the type II bias. By comparison with BMIQ, the results show that MGMIN improves the overall performance of normalization.

M-value
The β-value of DNA methylation for each probe is defined by the ratio of the methylated intensity (M) and the overall intensity (sum of methylated intensity and unmethylated intensity: where α is a constant offset (by default, α = 100) to regularize the β-value when the overall intensity is low. The β-value falls between 0 and 1 which follows a Beta distribution naturally. A β-value of 0 indicates the CpG site of the measured sample is fully unmethylated and a value of 1 indicates that the CpG site is completely methylated.
The M-value is calculated by the log2 ratio of the methylated intensity (M) vs. the unmethylated intensity (U): where α here is also an offset (by default, α = 1) to counteract the big changes caused by small intensity estimation errors. An Mvalue close to zero indicates that the measured CpG site is about hemimethylated. A positive M-value suggests that more copies of the measured CpG site are methylated than unmethylated and a negative M-value means more copies of the CpG site are unmethylated. The M-value has been widely used in two-color expression microarray analysis (Du et al., 2010). Due to more than 95% CpG sites have intensities more than 1,000 in Illumina methylation data, the α in β-value and M-value has an insignificant effect on observed results. So the relationship between β-value and M-value is shown as (with α ignored): According to the conclusions in Du et al. (2010), the M-value is more statistically valid in an analysis by modeling the distribution  Teschendorff et al. (2012).
of M-values because of it's homoscedastic. So we choose to adjust the M-values of type II probes into the distribution property of type I probes to correct the probe design bias.

MGMIN: M-value Gaussian-MIxture Normalization
Gaussian Mixture Model (GMM) has been widely applied as a clustering method in analyzing gene-expression microarray data (Yeung et al., 2001;Pan et al., 2002) and used to detect differential gene expression (McLachlan et al., 2006). In this paper, we apply GMM to distinguish different methylation states of CpG sites for further correction. The M-values of a single 450K microarray can be viewed as a finite Gaussian mixture model of several methylation states (hypomethylated-U, hemimethylated-H, hypermethylated-F). The probability density function of the M-value for a single CpG site (M i ) is defined as: where p(M i , θ ) represents the model density for M i with unknown parameter vector θ , K is the number of different methylation states (components), N(M i |µ k , σ 2 k ) is the probability density function of the kth Gaussian component, and π k is the mixing proportions which satisfy the constraint that K k=1 π k = 1 and 0 ≤ π k ≤ 1. The parameter vector θ consists of the mixing proportions π k , the mean value µ k and the standard deviation σ k , which can be estimated by the EM algorithm.
Next, we describe MGMIN in detail. First, M-values of type I and type II probes are modeled by GMM separately. Let µ S T and σ S T denote the mean value and standard deviation where S ∈ (U, H, F) and T ∈ (I, II). K I and K II are the numbers of components for type I and type II probes, which are both set as 3 by default.
Second, each probe is assigned to hypomethylated (U T ), hemimethylated (H T ), or hypermethylated (F T ) states by using the maximum probability criterion. Let U L T (U R T ) denote the U T probes with M-values smaller (larger) than µ U T , and let F L T (F R T ) represent the F T probes with M-values smaller (larger) than µ F T where T ∈ (I, II). Then, we calculate the probabilities of  U L II probes, i.e., where P represents the cumulative distribution function of the Gaussian component. These probabilities are transformed back to quantiles (M-value) by using the parameters µ U I and σ U I of type I probes, i.e., where P −1 returns the value of the inverse cumulative density function given the probability p and q is the normalized M-values for U L II . The similar operation is performed on F R II probes. Then, we merge the U R II , H II , and F L II probes into one set G on which a conformal (shift + dilation) transformation is performed. Some parameters are identified as minG = min{M G }, maxG = max{M G } and M G = maxG − minG. Similarly, the minimum value of F R II and the maximum value of U L II are also identified, i.e., minF = min{F R II } and maxU = max{U L II }. Two distance values can be calculated as where d f = M G ′ / M G is the dilation factor. So, the normalized M-values for type II probes consist of q for U L II , M G II ′ , and q for F R II .

M II
Lastly, the normalized M-values are transformed to β-values.
There are some important points to notice: (i) the initial values for µ and σ in EM algorithm are set as (−4,0,4) and (1,1,1) and small perturbations to the initial µ and σ will not affect the final model because MGMIN captures the natural property of the M-value of DNA methylation, (ii) K I will be changed to 4 automatically when µ F I − σ F I is smaller than µ F II − σ F II in order to ensure that µ F I can always be larger than µ F II and avoid the presence of an unexpected peak in transformed M-values of hypermethylated type II probes, (iii) if K I = 4, the F I will be the set of probes belonging to the component with the largest µ, while the U I contains the probes belonging to the component with the smallest µ and the other two components are assigned FIGURE 4 | The density curves of β-values for type I probes, type II probes and normalized type II probes (type II-MGMIN) for sample GSM815138 from GEO29290.
to H I , (iv) no thresholds need to be set by default or estimated by manual to distinguish the three different states of DNA methylation.

Datasets
We selected several public 450K datasets as following: Dataset 1: GSE29290 downloaded from GEO considered in Dedeurwaerder et al. (2011). We used the three replicates (GSM15136, GSM15137 and GSM15138) from the HCT116WT cell-line and matched bisulfite pyrosequencing (BPS) date for nine type II probes of sample GSM815138 (r3) ( Table 1 in Dedeurwaerder et al., 2011) to evaluate the performance of different methods.

MGMIN Needs No Default Initial Values of Parameters
Similar to the mixture model of BMIQ, MGMIN applies Gaussian mixture models for M-values instead of beta-mixture models for β-values. MGMIN also uses quantile information to correct the M-values of the type II probes into a distribution which is comparable with that of type I probes. MGMIN complies the inherent Gaussian mixture distributions for M-values of type I and type II probes to avoid setting any parameters manually, which is different from the default breakpoints in BMIQ. Thus, MGMIN needs less manual intervention than BMIQ. However, MGMIN is slightly inferior to BMIQ on some dataset (Table 1) due to the entire low quality of the dataset. Note that the PPV of BMIQ on Dataset 3 is lower than that of no normalization (RAW).

MGMIN Reduces Technical Variation
MGMIN is applied to Dataset 1 to identify the ability to improve reproducibility. The standard deviation (SD) for each probe across the three replicates was computed using no normalization (RAW), BMIQ, and MGMIN separately. As can be seen in Figure 1, both MGMIN and BMIQ almost made the density curves for the three replicates coincide with each other and reduced the technical variation significantly compared to no normalization. Compared to BMIQ, the standard deviation for type II probes adjusted by MGMIN is smaller (Figure 2). MGMIN also provided significant reduction of average absolute difference in β-values of type II probes between two samples in each of the three pairs of the three replicates (Figure 3).

MGMIN Reduces Probe Design Bias
MGMIN reduces the probe design bias via correcting the Mvalues of the type II probes such that the distribution curves for the M-values of the type I and type II probes show similar dynamic ranges and peaks (Figure 4). In Dedeurwaerder et al. (2011), the β-values for nine probes of type II by bisulfite pyrosequencing technique for sample GSM815138 (r3) were provided, which can be used as a gold-standard to evaluate the performance of different correction methods. Hence, we compared the normalized results of the nine type II probes in 450K arrays by MGMIN and BMIQ. As shown in Figure 5, although MGMIN performed slightly worse than BMIQ at the maximum value of the absolute deviation from BPS data, MGMIN significantly reduced the type II bias than BMIQ and raw data in terms of mean and root mean square error (RMSE) of the absolute deviation from the matched BPS values.

MGMIN Robustly Finds Informative Differential Methylation Probes Associated With HPV Status
The goal of a bias correction approach is to reduce the technical variation and identify the biological informative signals at the same time. We used a strategy similar to Teschendorff et al. (2012) to compare the result between MGMIN and BMIQ in identifying the differential methylation probes (DMPs) associated with HPV status. First, Dataset 2 consisting of two HPV+ and three HPV− fresh frozen HNC samples were used as the training set to obtain the DMPs associated with HPV status by the limma method (Smyth, 2005) and an FDR threshold 0.35 which was as same as (Teschendorff et al., 2012). Both Dataset 3 and Dataset 4 described in the methods section were used as test set. We reanalyzed Dataset 2 and got similar numbers of DMPs to those reported in Teschendorff et al. (2012) with no normalization method (Raw) or BMIQ method (shown in Table 1). The results in Table 1 shows that the positive predictive value (PPV) of MGMIN is slightly less than BMIQ in terms of GSE38266 (Dataset 3) whereas MGMIN outperforms BMIQ in GSE95036 (Dataset 4). The reason for MGMIN slightly inferior to BMIQ in Dataset 3 may be the entire low quality of the dataset (see Figure 6) which is that the ratio of samples passing filters is <0.9 (r = 0.88) under the least restrictive condition. Let τ p represent the p-value threshold for bad probes and τ r represent the threshold for the ratio of bad probes in a sample. The maximum value of τ r is set to 0.3 here in our opinion because a sample with more than 30% bad probes is vulnerable. We can get the same test dataset from GSE38266 with the one described in Teschendorff et al. (2012) which consists of 18 HPV+ and 14 HPV− samples under the following conditions: (i) τ p = 1e − 4 or 1e − 3 and τ r = 0.2 or 0.25, (ii) τ p = 1e − 2 and τ r = 0.1 or 0.15. Overall, MGMIN identified more true positive features than BMIQ.

DISCUSSIONS
In this paper, we have proposed a method called MGMIN for correcting the probe design bias of type II probes in Illumina Infinium 450K BeadChips, which can reduce the technical variation and improve the ability to find biologically differential methylation signals. We have shown that MGMIN outperforms BMIQ on multiple evaluation datasets in correcting the type II design bias and improving the data quality.
Similar to BMIQ, MGMIN uses quantile information to correct the M-values of type II probes while leaving the Mvalues of type I probes unchanged. The three-state beta-mixture distribution model in BMIQ sets two default breakpoints (0.2, 0.75) to divide the β-values into three classes: hypomethylated, hemimethylated, and hypermethylated, which works well for most cases. However, the result curves of BMIQ show obviously inconsistent in some samples with high heterogeneity. We set 3 or 4 classes for probes depending on the result of µ F T − σ F T to ensure that the fitted hypermethylated component of type II probes can be located in the left of the hypermethylated component of type I probes, which can partly eliminate the effects of the heterogeneity of samples.
Based on the results of Dataset 3, we think the high quality of dataset is the base of normalization, in other words, there is no meaning to correct the samples with low quality. It should be pointed out that the parameter estimation of MGMIN is slower than that of BMIQ (about 1.5 times), which can be relieved by reducing the number of iterations.
MGMIN can be used in the 450K methylation data preprocessing with other methods to normalize the values of the two type probes and improve the data quality.

DATA AVAILABILITY STATEMENT
The datasets for this study can be found in GEO: GSE29290, GSE38268, GSE38266, and GSE95036.