Variance-Preserving Estimation of Intensity Values Obtained From Omics Experiments.

Faced with the lack of reliability and reproducibility in omics studies, more careful and robust methods are needed to overcome the existing challenges in the multi-omics analysis. In conventional omics data analysis, signal intensity values (denoted by M and values) are estimated neglecting pixel-level uncertainties, which may reflect noise and systematic artifacts. For example, intensity values from two-color microarray data are estimated by taking the mean or median of the pixel intensities within the spot and then subjected to a within-slide normalization by LOWESS. Thus, focusing on estimation and normalization of gene expression profiles, we propose a spot quantification method that takes into account pixel-level variability. Also, to preserve relevant variation that may be removed in LOWESS normalization with poorly chosen parameters, we propose a parameter selection method that is parsimonious and considers intrinsic characteristics of microarray data, such as heteroskedasticity. The usefulness of the proposed methods is illustrated by an application to real intestinal metaplasia data. Compared with the conventional approaches, the analysis is more robust and conservative, identifying fewer but more reliable differentially expressed genes. Also, the variability preservation allowed the identification of new differentially expressed genes. Using the proposed approach, we have identified differentially expressed genes involved in pathways in cancer and confirmed some molecular markers already reported in the literature.


INTRODUCTION
The growing number of omics datasets (e.g., genomics, transcriptomics, proteomics, metabolomics) and the recent advances in multi-omics integration approaches have contributed to the better understanding of biological mechanisms and also the emergence of the personalized medicine. However, the lack of reliability and reproducibility in omics studies stands as one of the biggest obstacles in bridging the gap between research and practice of personalized medicine (Alyass et al., 2015;Karczewski and Snyder, 2018). Considering that inflated variability and non-robust estimation may lead to inaccurate and misleading results, this paper proposes improvements to the conventional estimation and normalization of the intensity values obtained from omics experiments. Specifically, the proposal is to estimate the intensity values by a method that accounts for the variability due to pixel-level uncertainties and to normalize these values by using LOWESS with suitably selected parameter values, preserving variation that may be relevant to subsequent analyses.
Image processing and fluorescence analysis are the preferred approaches for data quantification in microarray technologies. Although microarrays have been predominantly used since the end of the nineties to measure gene expression levels, they remain widely used to detect other omics data types, including microRNA expression, DNA methylation, single-nucleotide polymorphisms (SNPs), and copy number variants (CNVs) (Goodwin et al., 2016). After hybridization and cleaning of the target molecules, the array is scanned by activation with lasers at different wavelengths (one for each of the fluorophores used), and each laser channel generates an image. The pixel intensities within each spot in these microarray images are summarized to represent the hybridization signal. Depending on the platform (e.g., gene expression array, DNA methylation array, SNP array, and comparative genomic hybridization [CGH] array), the interpretation of this signal is different (e.g., gene expression levels, methylation levels, allele frequencies, and copy number alterations).
The continuance of the microarray technology can be mainly explained by the availability of many datasets in public repositories, such as the Gene Expression Omnibus (GEO) (Edgar et al., 2002;Barrett et al., 2012) and ArrayExpress (Kolesnikov et al., 2015), by the existence of well-established strategies for data analysis and experimental design, and by the low cost compared with the next-generation sequencing technologies. However, given that microarray analysis is still facing reliability and reproducibility problems, more robust and rigorous methods are needed to account for the high variability and biases introduced in all steps of a microarray experiment.
Several preprocessing and normalization procedures have been proposed to remove biases due to the inhomogeneity of the background and the different fluorescence properties of the dyes. However, biases introduced in the image analysis step, which includes spot segmentation and signal extraction, have not received the same attention, and those may partially explain the existing reliability and reproducibility problems in omics studies. Particularly, several factors, including image resolution, scanner settings, effectiveness of the segmentation algorithm, and unexpected behaviors during hybridization, may lead to errors in spot localization and classification of the pixels (as foreground or background, depending on whether it is situated within or around the spot). Thus, spot intensities are usually noisy and that high pixel-level variability leads to uncertainty in microarray quantification and correlates with variability between replicate spots on duplicate slides (Brown et al., 2001).
Given that even state-of-art image processing tools are susceptible to errors that significantly influence the variability of the data derived from microarray images (Ahmed et al., 2004), new segmentation and intensity extraction algorithms are still being developed in order to improve precision in spot quantification (Li et al., 2017;Karthik and Manjunath, 2018;Shao et al., 2019). Usually, these tools combine sophisticated algorithms and pixellevel analyses in order to obtain an accurate estimate of the signal intensity in each spot. However, to allow subsequent analyses to take into account possible errors and uncertainties arising from the image processing, the method output usually includes not only statistical measures of location (e.g., mean and median) of the foreground and background intensities of each channel of each spot but also measures of dispersion, including standard deviation and covariance between both channels.
Despite the common use of pixel-level variability measures as data quality criteria for filtering purpose, the conventional microarray analysis is solely based on statistical measures of location of the spot intensities Sun et al., 2011;Brady and Vermeesch, 2012). To improve robustness and reliability in microarray analysis, pixel-level uncertainties should be accounted for in the intensity log-ratio estimation and propagated to the next steps of the analysis.
Pixel-level uncertainties have been taken into account by many spot quantification algorithms in the literature, but requiring all pixel values to be available. Some of them are interested in improving the log-ratio estimator. Particularly, the method proposed by (Dodd et al., 2004) is a log-ratio estimator that corrects for signal saturation by regressing all pixel intensities at both test and control channels using a censored regression model. The META algorithm (Chan and Chang, 2009) estimates the intensity log-ratio by grouping the pixels according to their distance to the center of the spot and then weighting the log-ratio of each group in inverse proportion to its sample variance. A method that only uses pixel-level mean and variance summary statistics is the hierarchical maximumlikelihood estimator (Bakewell and Wit, 2005). However, it is not exactly based on the standard log-ratio representation of the spot intensity. It models the gene expression signal at control and treatment channels separately, incorporating the sample within-spot deviation and then performs the estimation using maximum likelihood. To the best of our knowledge, there is no intensity log-ratio estimator to be used after the image analysis phase (i.e., based solely on the pixel-level summary statistics) that takes into account pixel-level uncertainties.
The first contribution of this paper is a more robust estimator for the intensity log-ratio (M) and average log intensity (A) of a microarray spot that accounts for pixel-level variance and covariance between channels. For a spot t, these values are denoted by M t and A t , respectively . We derive these estimators by using the multivariate delta method (Casella and Berger, 1990). Specifically, we approximate the expected values of M t and A t by using their second-order Taylor's expansions, and the variance of M t and A t by using their firstorder Taylor's expansions. These expansions depend on the pixel-level variance and covariance between channels of the spot, whose sample estimates are readily accessible through standard output files of microarray image analysis tools.
After spot intensity estimation, it is necessary to perform a within-slide normalization to remove array-specific effects, intensity-dependent dye biases, and other systematic trends of the microarray data. The within-slide normalization based on the robust locally weighted regression (LOWESS) (Cleveland, 1979) is one of the most used techniques. The choice of the LOWESS parameters, particularly the smoothing parameter (also known as neighborhood size or bandwidth), dramatically affects the intensity and quality of the microarray data calibration. Although the smoothing parameter is still commonly set arbitrarily (around 0.2 and 0.4) Smyth and Speed, 2003;Drăghici, 2012), some data-driven methods have been proposed to select its optimal value (Berger et al., 2004;Futschik and Crompton, 2004a;Lee et al., 2008). All these methods are similar in that they choose the smoothing parameter by minimizing a measure of error of the LOWESS fit. Berger et al. (2004) use the mean-squared difference between the LOWESS estimates and the corresponding normalization reference levels as cost function. These normalization levels are the true spot-specific calibration errors, which are usually unknown. Thus, Berger et al. suggest to estimate them from control transcripts and replicate slides. However, they are not always available for all genes in a typical microarray experiment, making it hard to reliably use the method. Futschik and Crompton's selection method, named OLIN (Futschik and Crompton, 2004a;Futschik and Crompton, 2004b), has the advantage of not relying on a reference level. Its optimization procedures use the generalized cross-validation (GCV) criterion, an estimator of the prediction mean square error (PMSE), as cost function. Lee et al. (2008) proposes to select the smoothing parameter by minimizing the bootstrap estimate of the mean integrated square error (MISE) and show that their results are comparable to OLIN.
Although all these methods have shown superiority over LOWESS normalization with a fixed arbitrarily chosen smoothing parameter, they lack in taking into account any heteroskedasticity in the data. In addition, they usually suffer from a poor bias-variance trade-off, tending to choose small smoothing values, which yield unnecessarily complicated (with high variance) LOWESS fits.
The second contribution of this paper is a data-driven method for selecting the smoothing parameter of the LOWESS normalization process. Inspired by the previous proposed methods, we choose the optimal smoothing value by minimizing a mean squared error criterion. However, our selection method also takes into account heteroskedasticity of the microarray data and offers a better bias-variance trade-off by selecting from among the low-MSE fits the one that is the most parsimonious. The parameter selection is obtained by solving a discrete optimization problem and is based on conventionally accepted ideas for analysis of M-plots-a graphical tool showing the curve of the MSE against the effective degrees of freedom of the estimate .
Given that the primary application of DNA microarrays has been to measure gene expression levels, we focus in this paper on variation-preserving estimation and normalization methods for gene expression levels from two-channel (or two-color) microarrays. However, it is straightforward to adapt the same ideas to improve analysis of other types of microarray data, even from single-channel technologies.
The proposed methods were evaluated by a differential gene expression analysis from real intestinal metaplasia and normal microarray samples. The proposed estimators for the M t and A t values were compared with the conventional estimators that neglect the pixel-level variability. In addition, we compared the proposed method for selecting the LOWESS smoothing parameter with OLIN, as it is conceptually similar to the other existing methods and can be applied even to microarray experiments with few or no replicates. Results show that a more robust and conservative analysis is performed when the LOWESS smoothing parameter is selected by our method, potentially reducing the number of false-positive differential expressions. Besides, both the pixel-level variabilities incorporated by the proposed estimators for the M t and A t values and the variability preserved by our more parsimonious normalization method contributed to the identification of new differentially expressed genes. Thus, the proposed methods may also reduce the falsenegative rate.

MATERIALS AND METHODS
Two procedures that critically affect the adequacy of microarray data analysis are the spot quantification, which extracts summarized quantitative measures of the pixel intensities within each spot of the microarray slide, and the within-slide normalization, which removes dye-specific biases and other systematic noises simultaneously from all logged spot intensities (M t and A t values).
In the section Intestinal Metaplasia Database, we describe a gene expression dataset used to illustrate the application of our proposed methods. In the section Improved Estimators for the M t and A t values, we show our improved estimation method for the M t and A t values that incorporates pixel-level variability. In the section Estimators for the Variances of the M t and A t Values, we discuss some criteria that can be used for proper setting of the parameters of the LOWESS within-slide normalization and we propose an algorithm for selecting the optimal value for the smoothing parameter (denoted by f).

Intestinal Metaplasia Database
Due to a chronic inflammatory process, the normal squamous mucosa of the stomach may be replaced by columnar intestinaltype epithelium, characterizing a disease called intestinal metaplasia of the stomach. Since adenocarcinoma of the stomach and inflamed intestinal mucosa are strongly associated (Coussens and Werb, 2002), intestinal metaplasia may be a significant risk factor for gastric cancer.
We analyzed data from a two-color microarray experiment with tissues samples from 90 different subjects, being 35 from tissues representing type II intestinal metaplasia and 55 from tissues representing the normal condition, obtained from the Tumor Bank at A.C. Camargo Cancer Center/Antonio Prudente Foundation.
It was used the standard reference design (Churchill, 2002), in which each sample is hybridized against a pool of normal tissues using the same orientation of dye labeling. Gene expression levels were measured on Agilent Whole Human Genome Microarrays 4x44K G4112F (design ID 014850), each slide containing 41,093 unique probes. The scanned images of the microarray slides were processed by Agilent Feature Extraction software, version 9.5, where statistics (mean, standard deviation, and covariance) of the foreground and local background pixels were computed for each spot, in both test and reference channels. Each microarray spot contains about 60 foreground pixels.
This study was carried out in accordance with the recommendations of the international guidelines for investigations involving human beings with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Ethics Institutional Committee of the A.C. Camargo Cancer Center (process number 1023/07).

Improved Estimators for the M t and A t Values
Usually, in microarray analysis, the test channel is denoted by (red), and the reference channel is denoted by G (green), following this usual notation, denoted by R tj and by G tj , the intensity value of the jth pixel within the th spot, respectively, in the test and reference channel. The relative expression of pixel j within spot is denoted by M tj and defined as follows: The average expression of pixel within spot is denoted by A tj and defined as follows: Usually, image analysis software does not provide all pixel intensity values within each spot. Nonetheless, it provides several descriptive statistics of the foreground and background pixel intensities, including sample estimates for the mean, median, variance, and covariance between the two channels.
To incorporate the pixel-level variability in the analysis, we derived an approximation of the expected values of M tj and A tj by using the multivariate delta method (Casella and Berger, 1990). Assuming that the functions (1) and (2)  The derivation is presented in Appendix 4. It is reasonable to assume that the variables R tj , G tj , M tj and A tj have a distribution with well-defined mean and variance. Particularly, Hoyle et al. (Hoyle et al., 2002) empirically showed that the distribution of the pixels within a spot is heavy-tailed (a non-Gaussian distribution) and well-approximated by a log-normal distribution. Consequently, M tj and A tj follow a distribution which is well-approximated by a Gaussian distribution and all the variables have at least the first and second moments finite.
Let R tc and G tc be non-zero estimates of, respectively, ( ) R tj and ( ) G tj , which represent average foreground signals after correction for removing the background influence. The subscript indicates dependence on the background correction. Also, let ˆ( ) σ 2 R t and ˆ( ) σ 2 G t be estimates of, respectively, Var (R tj ) and Var (G tj ), which are assumed to be independent of the background correction. Note that mean and variance estimates are calculated across observed foreground pixel intensities within the spot at the respective channel.
We can derive improved estimators for ( ) M tj and ( ) A tj as follows: Note that the conventional estimators for the M tj and A tj values, given by are approximations of, respectively, ( ) M tj and ( ) A tj derived from only the zeroth-order Taylor's expansion of the functions that define M tj and A tj . Thus, the conventional estimators ignore the known measures of pixel-variability, which represent uncertainties in the gene expression measurements. Figure 1 illustrates the differences between the estimators for the ( ) M tj and ( ) A tj for a randomly chosen microarray slide of the database described in the section Intestinal Metaplasia Database. Since these estimators may suffer from numerical instability if the corrected foreground signals, R tc and G tc , are very close to zero, we removed the background influence by applying the normexp method (Ritchie et al., 2007) with offset equals to 50. The top 20 spots with the highest pixel-level variability are highlighted in red plus symbols. Several of these spots have low average intensity (small estimates for ( ) A tj ) and a small difference between the intensities of the two channels (estimates for ( ) M tj close to zero), but they are not the majority. The differences between the proposed estimators, defined in Eq. (3) and (4), and the conventional estimators, defined in Eq. (5) and (6), are shown in Figures 1C, D. These differences are due to the distinct parts between their respective formulas. When computing the  M j estimates, the ratio of the pixel-level variability to the squared expected value in the test channel appears in Eq. (3) with an opposite sign to the same term in the reference channel. Thus, positive and negative differences between the estimates for ( ) M tj may occur if such terms do not cancel each other out. Figure 1C shows the ilde  M t estimates were smaller than the M t estimates for the genes with highest pixel-level variance, indicating a larger variance in their test channels. Figure 1D shows some  A t estimates were smaller than the Â t estimates. The reduction is explained by the fact that the additional terms in Eq. (4) are negative for any positive pixellevel variability in any channel.

Estimators for the Variances of the M t and A t Values
Since we have also available the sample covariance between R tj and G tj , denoted by ˆ( , ) σ R G t t , we applied the multivariate delta method for deriving estimators for the variances of the M tj and A tj . We calculated the variance of the first order Taylor's expansion of the functions (1) and (2) that define, respectively, M tj and A tj , as shown in Appendix 5. The variance estimators for M tj and A tj , for pixels j within spot t are: The variances of M tj and represent pixel-level uncertainties of the th spot. They can be used, for instance, for assessing the quality of the th spot or for constructing confidence intervals for the parameters ( ) M tj and ( ) A tj .

Optimal Selection of the LOWESS Parameters
To simplify the notation, we will denote the estimates for ( ) M tj and ( ) A tj , independently of the estimation method used, by,

respectively, M t and A t values.
It is necessary to remove from these M tj intensity values the dependent dye-specific biases and other systematic errors by using some within-slide normalization method.
In the LOWESS within-slide normalization method, one estimates for each microarray slide a smoothing function μ that maps each A t observed value to a smoothed M t value, ˆ( ) µ A t . Since ˆ( ) µ A t is considered an estimate of a dye-dependent bias, it must be subtracted from the corresponding observed M t value to obtain a residual value representing, presumably, the biologically relevant gene expression level.
An appropriate LOWESS estimation depends on the choice of its parameters. According to loader (Loader, 1999), the FIGURE 1 | Comparison between conventional and proposed estimation methods for the M t and A t values for the microarray slide with ID 251485069395_1.4. The M-A plots in (A) and (B) were obtained by using, respectively, the conventional and improved estimators for the M t and A t values. Plots (C) and (D) show the conventional against the improved estimates for, respectively, the M t and A t values. Top 20 genes with the highest pixel-level variance are highlighted in red plus symbols. The test channel contains RNA samples of normal gastric mucosa, and the control channel contains samples from a common reference. The background influence was removed from the foreground signals by the normexp method with offset.
weight function and the number of iterations of the robustness algorithm are not critical parameters. Cleveland (Cleveland, 1979) comments that good choices for these parameters are, respectively, the tricube function and three iterations. However, the degree of the local polynomials and the smoothing parameter f, which, in the nearest neighbor method, is a number between and indicating the proportion of data used in each local fit, affects the bias and the variance of the fit.
Specifically, the higher the degree of the local polynomial (related to the complexity of the model), the lower the bias of the fit (probably, fitting the data very well). However, the additional parameters of this more complex model increase the variance of the fitted values, yielding a poor generalization ability (i.e., the model will have a large error). Thus, to avoid unstable LOWESS estimates, several references as (Loader, 1999;Yang et al., 2001;Dudoit et al., 2002;Smyth and Speed, 2003) recommend using local polynomials of degree one, mainly in the presence of sparsity, as is the case of microarray data.
The effects of the smoothing parameter f on the bias and variance of the fit are opposite to those of the degree of the local polynomials. Since the f parameter indicates the number of observations that will be used in the local polynomial estimation, when f value is large, a simple polynomial may not fit well to all observations in the neighborhood, distorting or ignoring essential features. In other words, the estimation of the smoothing function can be significantly biased. On the other hand, when a low f value is chosen, the number of observations may be insufficient to capture the general behavior of the data, resulting in a very noisy (large variance) fitness function.
In the next section, we propose a method for selecting a value for the f parameter, focusing on microarray data normalization. Our method takes into account the intrinsic characteristics of the bias and variance of the fit as well as of gene expression data.

Lowess Smoothing Parameter Selection
For microarray data normalization, the ideal LOWESS fitted curve captures only trends and effects from systematic errors, retaining all biological variation. However, it critically depends on the choice of the f parameter value. Figure 2 illustrates the MA plot of the microarray slide shown in Figure 1B, with different LOWESS fits yielded by f values varying from 0.05 to 0.9. The improved estimation method was used to obtain the M t and A t values, that is, the  M t and  A t estimates.
The quality of a LOWESS estimator can be assessed by the MSE, which measures how close the estimator μ is of the true mean function μ : Since the real curve μ is unknown, we need a criterion to evaluate the MSE. Under the assumption of heteroskedasticity, Cleveland and Devlin  propose the Mallows' Cp criterion for local fitting that can be used as as MSE estimator. In the presence of heteroskedasticity, as usual for microarray data, the heteroskedasticity-robust Cp (HRCp) criterion, proposed by Liu and Okui (Liu and Okui, 2013), may be a more appropriate MSE estimator. We detail this MSE estimator next. Considering within-slide data points, the evaluation of the LOWESS smoothing function on any point is given by a linear combination of the observed points, whose are assigned according to the distance of A to the A t observed points: Consider the T × T matrix L which maps the observed to the fitted values: Supposing that the variance of M t , across T spots of a microarray slide, is constant and equals to σ 2 , the Mallows' Cp for local fitting is defined as: et al. (1988) shows that σ 2 can be estimated as follows: When heteroskedasticity is present, Mallows' Cp criterion is not an appropriate MSE estimator. Considering the T × T diagonal matrix Σ, whose th diagonal element is given by a nonhomogeneous variance σ t 2 of M t , a robust MSE estimation can be achieved by using the HRCp criterion, defined as: According to Loader (1999), σ t 2 can be estimated locally by calculating the error variance (the residual sum of squares divided by the corresponding degrees of freedom) of a nearly unbiased LOWESS fit, which can be yielded using a very small value for the smoothing parameter (e.g., f = 0.1. Since the local variance estimates can be very noisy, it may be appropriate to smooth them using a gamma kernel. Several authors suggest to choose the f value which minimizes a measure of error of the LOWESS fit, such as the MSE criterion (Berger et al., 2004;Futschik and Crompton, 2004a;Lee et al., 2008). However, other authors (Mallows, 1973;Loader, 1999) argue that a selection based only on minimizing the MSE criterion is a poor procedure since it ignores the intrinsic information of the bias and variance of the fit. Therefore, following their suggestion, we propose a method based on a graphical tool called M-plot. It is a graph of the MSE estimate as a function of the effective degrees of freedom of the fit.
M-plots illustrating the f parameter selection method for a typical microarray slide (ID 251485069395_1.4) are shown in Figure 3. Dots show MSE estimates (by HRCp criterion) and respective degrees of freedom (by v 2 definition) of LOWESS fits (on the M t and Â t estimates, in the first M-plot, and on the  M t and  A t , in the second M-plot) obtained with f parameter varying from to 0.2 We fixed the other LOWESS parameters (local polynomials of degree one, tricube weight function, and three iterations) so that the M-plot curve shows only the effect of the f parameter on the bias-variance compromise. Large f values tend to yield simple fits (with fewer degrees of freedom), which have a small variance, but a large bias. On the other hand, minimal f values tend to yield complex fits (with many degrees of freedom), which have a small bias, but a large variance.
For the microarray slide in Figure 3, a selection method based only on the minimization of the MSE curve would choose the smallest evaluated f value (0.2). However, any f value within the flattening region near to the minimum (the region with light-colored dots) is a good choice, in the sense that it yields a low-MSE fit Loader, 1999). Depending on the type of application, we can choose between one value which yields a low-bias fit (with more degrees of freedom) or a low-variance fit (with fewer degrees of freedom). Since we want to estimate a natural phenomenon behavior, we propose to select from the flattening region the f value which yields the simplest LOWESS fit (the one with fewest effective degrees of freedom). The biggest dot in each M-plot indicates the selected f value. The detection of the flattening region is made by searching points for which the derivative of the MSE curve is small. We check for each sequence of three points near the minimum whether the difference between the MSE values is small. If so, these points are considered as belonging to the flattening region.
The f parameter selection method can be summarized in the following discrete and constrained optimization problem. Consider a sequence of l different values for f, {f 1 , f 2 , ... , f l }, and denoted by μ f k , the LOWESS fit yielded by using the value f k for the f parameter. Also, let: If the minimum of the M-plot curve is far away of the point corresponding to the second lowest MSE estimate, the previous problem has no solution. In that case, the f value that yields the fit with lowest MSE estimate is selected. Specifically, the f parameter value is selected by solving the following problem: . HRCp such that µ µ ∈

APPLICATION ON INTESTINAL METAPLASIA DATA
To investigate the effects of the proposed methods, we preprocessed the data described in the section Intestinal Metaplasia Database by using all discussed methods and compared the identified differentially expressed genes. First, we applied the normexp method with offset value of for removing the background influence. Then, we compute the M t and A t values both by the conventional estimation methods, defined in Eq. (5) and (6), and by the proposed estimation methods, defined in Eq. (3) and (4). The LOWESS withinslide normalization was carried out as discussed in the section Optimal Selection of the LOWESS Parameters. For comparison purpose, the f smoothing parameter was selected both by the OLIN method (considered by us as a conventional approach) and by the proposed method, discussed in the section LOWESS Smoothing Parameter Selection. Since data from all microarray slides present overdispersion, we used the HRCp criterion as cost function of our selection method.
Therefore, the following four preprocessing procedures were applied separately to the original data: 1. Conventional estimation for M t and A t and LOWESS withinslide normalization using f parameter selected by OLIN; 2. Improved estimation of M t and A t and LOWESS within-slide normalization using f parameter selected by OLIN; 3. Conventional estimation of M t and A t and LOWESS withinslide normalization using f parameter selected by the proposed method; 4. Improved estimation of M t and A t and LOWESS within-slide normalization using parameter selected by the proposed method. Figure 4 shows the distribution of the optimal values for the LOWESS f parameter, according to the proposed selection method with HRCp criterion, for the entire database, separated by normal and intestinal metaplasia conditions (both, hybridized against a pool of normal tissues). In the first plot, the LOWESS curve was fitted on the M t and Â t estimates and, in the second plot, on the  M t and  A t estimates. The average of the selected f values was close to 0.5.
As expected from a method that neither takes into account heteroskedasticity of the data nor attempts to make a good balance between bias and variance, the OLIN method selected the smallest evaluated value (0.2) for most of the slides. Same results were obtained when the M t and A t values were estimated by the conventional and by the proposed estimator. Such behavior has been reported in the literature, implying that the optimal f values according OLIN are usually close to the default one (Chiogna et al., 2009).
After preprocessing the data, a two-sample t-test assuming unequal variance was performed for each spotted gene to determine whether its expression is statistically different between gastric tissues in normal and intestinal metaplasia groups. However, since we are interested in directly assessing the impact of each proposed method on the t-statistics and p-values rather than making inference about differential expression, the comparative study was performed before applying a multiple testing correction.

Comparison of the Results
Results of a pairwise comparison among the p-values and t-statistics obtained by the four preprocessing methods are shown in Figure 5. In the left-column plots, we compare the p-values and, in the right-column plots, we show the changes in the difference between the group means (the absolute value of the t-statistic numerator) and in the within-group variability (the denominator of the t-statistic). Only genes with p-value less than 5% were considered. September 2019 | Volume 10 | Article 855 Frontiers in Genetics | www.frontiersin.org The left-column plots show that most of the points are distributed around the 45-degree line. Thus, the p-values and, consequently, the total number of differentially expressed genes, even at a lower significance level, were similar among the four methods.
The first-and second-row plots show how p-values and t-statistics were affected by estimating the M t and A t values with the proposed method, which takes into account the pixellevel uncertainties. The genes represented by blue plus signs were identified as differentially expressed only when using the proposed estimator for the M t and A t values.
The genes represented by green crosses were identified as differentially expressed only when using the conventional estimator for the M t and A t values.
When the LOWESS f parameter is selected by OLIN (first-row plots), it is clear that the within-group variability decreases when using the proposed estimators for the M t and A t values. When the LOWESS parameter is selected by our method (second-row plots), there is still a reduction in the within-group variability. However, this impact is less clear because of the variability introduced when the LOWESS f parameter is selected by our method.
The third-and fourth-row plots compare p-values and t-statistics obtained by OLIN and the proposed approach for selecting the LOWESS f parameter. The genes represented by blue plus signs were identified as differentially expressed only when f was selected by the proposed method. The genes represented by green crosses were identified as differentially expressed only when selecting f by OLIN. It is clear that, for most genes, both within-group variabilities increased, implying that the normalization procedure was more conservative, and thus, more potentially relevant information is retained. In addition, for many genes, the increase in the within-group variability was counterbalanced by an increase in the distance between the groups. Such effect is even most pronounced when the proposed estimator for the M t and A t values are used. Thus, their respective p-values reduced enough to consider them as differentially expressed genes.
The diagrams in Figure 6 show a comparison of the methods with respect to the total number of genes with p-value less than 5%. On the left, the p-values were not corrected for multiple tests, while on the right, the p-values were adjusted by the false discovery rate (FDR) correction (Benjamini and Hochberg, 1995).
Note that the four methodologies are quite different in terms of which genes were identified as differentially expressed. As a consequence of the more conservative (milder) noise reduction performed in the LOWESS withinslide normalization procedure with f parameter selected by our method, fewer genes are identified as differentially expressed. However, regardless of the normalization method, more genes could be identified as differentially expressed when the M t and A t values were estimated by the proposed estimation method that incorporates pixel-level variability. Given that both proposed methods make the analysis more robust by incorporating and preserving information neglected by the conventional methods, we can argue that they are contributing to the reduction of both false-positive and falsenegative rates.

Validation Analysis
To check the consistency of our analysis, we compared our results with those reported in the literature. Out of the genes which are associated with intestinal metaplasia according to the Gene Expression Omnibus platform (Edgar et al., 2002) of the NCBI (National Center for Biotechnology Information) website, 80 spotted genes (corresponding to 63 unique genes) have p-value (before FDR correction) less than 5%, and 35 spotted genes (corresponding to 29 unique genes) have p-value (after FDR correction) less than 5%. These findings are summarized respectively in Tables 1, 2. In addition, Figure 7 compares the total number of validated genes identified by each method with p-value less than 5% (before FDR correction).
Greater differences in inference were observed among the genes whose p-value is close to the significance level. These genes have a more subtle differential expression, which can be easily damaged by measurement errors and poor estimation and normalization methods. Thus, the more accurate and careful analysis provided by the proposed methods is especially important for making decisions on the differential expression of these more sensitive genes.
Two replicates of the HSPB1 gene could not be identified as differentially expressed when using both the conventional estimators for the M t and A t values and our selection method for the LOWESS f parameter. Thus, the estimation of the M t and A t values by the proposed estimators was crucial in determining the differential expression of the HSPB1 gene.
The genes PTEN, CTNNB1, MLH1, CXCR4, and CXCR1 could only be identified as differentially expressed when the LOWESS parameter was selected by our proposed method. Particularly, the gene CXCR4 only was determined as differentially expressed when the improved estimators for the M t and A t values were also used. In contrast, the gene KRT14 was no longer identified as differentially expressed when the LOWESS f parameter was selected by our proposed method. In the following, we briefly describe the association of those genes with intestinal metaplasia of the stomach according to the literature data: • HSPB1 (heat-shock protein beta-1, also known as HSP27heat-shock protein 27): It has a protective role against stress-induced cell damage, and its expression has been considered critical for mucosal protection in the stomach (Ebert et al., 2005). Also, it has been reported as down-regulated in esophageal adenocarcinoma (Lv et al., 2019). • PTEN (phosphatase and tensin homolog): It has been identified as overexpressed in intestinal metaplasia and is a known marker for tumorigenesis and progression of gastric carcinoma (Yang et al., 2003). • CTNNB1 (beta-catenin 1): It is a canonical oncogene that has been identified as overexpressed in intestinal metaplasia and gastric adenocarcinomas (Werner et al., 2001;Huang et al., 2018). • MLH1 (mutL homolog 1): Its expression has been reported as absent or downregulated in intestinal metaplasia, dysplasia, and gastric cancers (Takeda et al., 2012;Hu et al., 2018). • CXCR4 (chemokine receptor type 4): Its expression has been associated with the staging of gastric cancer, being reduced in the majority of gastrointestinal tumors and significantly higher in patients with advanced stages of gastric cancer (Shibuta et al., 1997;Hannelien et al., 2012;Nikzaban et al., 2014). • CXCR1 (C-X-C motif chemokine receptor 1): It has been reported to be strongly expressed in gastric carcinoma (Eck et al., 2003;Hannelien et al., 2012). • KRT14 (keratin 14): It is a squamous cell marker that is downregulated by CDX2 transfection (Liu et al., 2007). In addition, although it has been determined as significantly overexpressed in intestinal metaplasia by our analysis when the parameter was selected by OLIN, it has been reported as down-regulated in esophageal adenocarcinoma when compared to normal esophagus (Lv et al., 2019).

Genes Involved in Cancer
By performing a gene enrichment analysis, we identified, at a significance level of 5% (after FDR correction), 31 differentially expressed genes that are involved in cancer. Their respective p-values and fold changes are shown in Table 3. We remark that their association with intestinal metaplasia has not been clearly demonstrated yet. Thus, further investigation has to be done to confirm such conclusions. Particularly, two replicates of the CCND1 gene and the LAMB2 gene were identified as differentially expressed only by the conventional approaches, suggesting that they may be false positives. Next, we briefly describe their association with cancer: • CCND1 (cyclin D1): In contrast to its underexpression identified by the conventional analyses, it has been frequently reported as overexpressed in intestinal metaplasia, human neoplasias, and several tumors (Hosokawa and Arnold, 1998;Franchi et al., 2015). • LAMB2 (laminin subunit beta 2): Although its expression has been associated with some carcinomas, ts expression is tightly regulated in normal human tissues and in disease (Wewer et al., 1994;Ljubimova et al., 2006).

DISCUSSIONS
Faced with the growing trend of multi-omics data integration in the midst of a replication crisis, improved microarray data analyses are crucial to identifying more reliable results (Ritchie et al., 2015a).

12
September 2019 | Volume 10 | Article 855 Frontiers in Genetics | www.frontiersin.org TABLE 1 | Genes reported in the literature as associated with intestinal metaplasia of the stomach that were identified as differentially expressed in our analysis at a significance level of 5% (after FDR correction). Given that several pixel-level summary statistics are readily available in microarray databases, but are usually discarded in conventional approaches, we propose an improved estimation method for the M t and A t values, which takes into account the pixel-level variability. Specifically, we applied the multivariate delta method to derive estimators for the expected values of M t and A t , considering their Taylor's expansion up to the second-order terms. The conventional estimators, nonetheless, approximate the expected values considering only the zeroth-order term. Since the functions that define M t and A t are analytic (they are combinations of logarithmic function through addition or subtraction), the higher the number of terms of the Taylor expansion, the better the approximation of the function. Thus, we expect that the proposed estimators provide a better quantification of the hybridization signal. Also, by using these improved estimators, pixel-level dispersion can play an essential role in the analysis, increasing reliability.
To minimize the propagation of errors, the M t and A t values have to be properly normalized. Thus, we also propose a method for selecting the LOWESS smoothing parameter f that provides an optimal bias-variance compromise, considering some specific characteristics of microarray experiments, such as heteroskedasticity. This optimal normalization method leads to a more parsimonious correction of the systematic biases and, consequently, to greater preservation of the biological variation of interest.
By using the proposed methods, more variability information is considered and retained, improving inferences and preventing false conclusions. Thus, we expect to perform a more conservative analysis, where possibly fewer but more reliable differentially expressed genes are identified. In other words, we expect a reduction in both the false-positive and false-negative error rates.
Besides the theoretical support, relevant empirical observations could be drawn by a comparative study between the methods using real intestinal metaplasia microarray data. The results shows that inferences on differential gene expression were moderately affected by the incorporation of the pixel-level variability in the estimation of the M t and A t values and significantly affected by the LOWESS within-slide normalization using a smoothing parameter selected by the method. Both proposed methods tend to increase the within-group variability (the denominator of the t-statistic). However, for many genes, such increase occurred along with an increase in the difference between the group means (the absolute value of the t-statistic numerator), significantly reducing their respective p-values. Thus, many genes were identified as differentially expressed only when the proposed methods were used and some of them have been validated by other studies.
It is important to remark that most of the genes reported in the literature as differentially expressed in intestinal metaplasia were validated with a very strong association with the disease. Thus, these genes are probably more robust to difference approaches for estimating and normalizing the gene expression levels. On the other hand, genes sensitive to methods that address essential uncertainties in measurements are precisely those plagued with major reproducibility issues. Measurement error is one of the most damaging sources of error and has been neglected in many published analyses, thereby increasing uncertainty in parameter estimates and even inflating the estimates of effect sizes (Loken and Gelman, 2017). Thus, particularly for those sensitive genes, a more robust analysis is needed so that false conclusions are not made.
In this paper, we focused on gene expression from twocolor microarray data, but it is possible to use the same ideas to improve estimation and normalization of any fluorescent signal quantified by microarray image analysis. Also, the proposed methods could be adapted for oligonucleotide (one-color) microarray data. Particularly, the cyclic LOWESS normalization method (Bolstad et al., 2003) could be extended by just considering that the M t and A t values are defined by comparing pairs of arrays instead of pairs of channels and that the LOWESS normalization is applied to all distinct combination of two arrays. Although not so straightforward, it is also possible to adapt our methods to handle next-generation sequencing (NGS) data. Recently, Law et al. (Law et al., 2014) showed that RNA-Seq counts after log transformation and normalization by sequencing depth (log-counts per million, or log-cpm) can be properly analyzed by methods based on the normal distribution if a precision weight for each observation is taken into account. It was used to adapt all methods in the limma package (initially developed for microarrays) to also handle RNA-Seq and other sequence count data (Ritchie et al., 2015b). Therefore, considering the current need for accounting and propagating measurement uncertainties through analyses of NGS data (O'Rawe et al., 2015), a possible future work is to adapt our ideas to improve transcriptome profiling from RNA-Seq data. Specifically, one could investigate whether it is possible to use the delta method for incorporating a measure of uncertainty Frontiers in Genetics | www.frontiersin.org TABLE 3 | Genes belonging to the "pathways in cancer" category identified as differentially expressed between normal and intestinal metaplasia groups at a significance level of 5% (after FDR correction). to each base call, usually provided by base-calling algorithms, into the log-cpm estimator, leading to a more accurate gene expression quantification from RNA-Seq data.

DATA AVAILABILITY
The omicsMA R package contains the source code of the proposed methods and part of the metaplasia dataset analyzed in this study. It was implemented using R, version 3.5.1, and depends on the locfit (Loader, 2013), maigesPack (Esteves et al., 2016), and limma (Ritchie et al., 2015b) R packages. The omicsMA R package is available at https://github.com/adele/ omicsMA, and the latest release is available at https://github. com/adele/omicsMA/releases/latest.

ETHICS STATEMENT
This study was carried out in accordance with the recommendations of the international guidelines for investigations involving human beings with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Ethics Institutional Committee of the A.C. Camargo Cancer Center (process number 1023/07).

AUTHOR CONTRIBUTIONS
AR and RH conceived of the presented ideas. AR derived the models, implemented the methods, and analyzed the data. AR wrote the manuscript with support from RH and JS. All authors discussed the results and contributed to the final manuscript. RH and JS supervised the project.