Robustification of Linear Regression and Its Application in Genome-Wide Association Studies

Alamin, Md.; Sultana, Most. Humaira; Xu, Haiming; Mollah, Md. Nurul Haque

doi:10.3389/fgene.2020.00549

ORIGINAL RESEARCH article

Front. Genet., 08 June 2020

Sec. Statistical Genetics and Methodology

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.00549

Robustification of Linear Regression and Its Application in Genome-Wide Association Studies

Md. Alamin^1,2

Most. Humaira Sultana¹

Haiming Xu¹^*

Md. Nurul Haque Mollah²^*

¹Institute of Crop Science and Institute of Bioinformatics, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
²Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh

Regression analysis is one of the most popular statistical techniques that attempt to explore the relationships between a response (dependent) variable and one or more explanatory (independent) variables. To test the overall significance of regression, F-statistic is used if the parameters are estimated by the least-squares estimators (LSEs), while if the parameters are estimated by the maximum likelihood estimators (MLEs), the likelihood ratio test (LRT) statistic is used. However, both procedures produce misleading results and often fail to provide good fits to the reasonable space of the dataset in the presence of outlying observations. Moreover, outliers occur very frequently in any real datasets as well as in the molecular OMICS datasets. Hence, an effort is made in this study to robustify MLE based regression analysis by maximizing the β-likelihood function. The tuning parameter β is selected by cross-validation. For β = 0, the proposed method reduces to the classical MLE based regression analysis. We inspect the performance of the proposed method using both synthetic and real data analysis. The results of simulations indicate that the proposed method performs better than traditional methods in both outliers and high leverage points to estimate the parameters and mean square errors. The results of relative efficiency analysis show that the proposed estimator is relatively less affected than the popular estimators, including S, MM, and fast-S for normal error distribution in case high dimension and outliers. Also, real data analysis results demonstrated that the proposed method shows robust properties with respect to data contaminations, overcome the drawback of the traditional methods. Genome-wide association studies (GWAS) by the proposed method identify the vital gene influencing hypertension and iron level in the liver and spleen of mice. Furthermore, we have identified 15 and 21 significant SNPs for chalkiness degree and chalkiness percentage, respectively, by GWAS based on the proposed method. The variant of the SNPs might be provided the new resources for grain quality traits and could be used for further molecular and physiological analysis to enhance the better quality of rice grain. These results offer an important basis for further understanding of the robust regression analysis, which might be applied in various fields, including business, genetics, and bioinformatics.

Introduction

Regression is the mathematical measure for the average relationship between two or more variables in terms of the original units at the data. It is a popular statistical technique for predicting the response variable for fixed value (s) of the independent variable (s). The purpose of linear regression analysis is to fit the straight line to the observed data. It is widely used in multi-disciplinary research areas. Let us consider a p variables linear regression model as follows:

\begin{array}{l} y_{i} = γ_{0} + \sum_{j = 1}^{p} γ_{j} x_{i j} + u_{i} & (1) \end{array}

where y_i (i = 1, 2, …, n) is the ith observation of the response variable; γ₀ is the intercept term; γ_j (j = 1, 2, …, p) are the regression parameters; x_ij (j = 1, 2, …, p) are the ith observation of p explanatory variables that are assumed to be independent of each other and u_i is the random error term corresponding to ith observation. We assumed that u_i follows normally and identically distributed with mean zero and variance σ². Under this assumption, the density function of y_i is given by:

\begin{array}{l} f_{θ} (y_{i}) = {(2 π σ^{2})}^{-^{1} /_{2}} exp {- \frac{1}{2 σ^{2}} {(y_{i} - γ_{0} - \sum_{j = 1}^{p} γ_{j} x_{i j})}^{2}} & (2) \end{array}

and the average log-likelihood function for the model parameter $θ = (γ_{0}, γ_{1}, \dots, γ_{p}, σ^{2})$ in Equation (2) is given by:

\begin{array}{l} \begin{matrix} \bar{L} (θ | Y, X) = \frac{1}{n} \sum_{i = 1}^{n} log f_{θ} (y_{i}) = \frac{1}{n} log L (θ | Y, X) \\ = - \frac{1}{2 n} log (2 π σ^{2}) - \frac{1}{2 n σ^{2}} {(Y - X γ)}^{'} (Y - X γ) \end{matrix} & (3) \end{array}

where $γ = {(γ_{0}, γ_{1}, \dots, γ_{p})}^{'}$ , $Y = {(y_{1}, y_{2}, \dots, y_{n})}^{'}$ and X = [x_ij] is the matrix of order n × (p + 1) with x_i0 = 1 for i = 1, 2, …, n. Here the problem is to test the significance of the regression parameters with the following hypothesis:

\begin{array}{l} H_{0} : γ_{1} = γ_{2} \dots = γ_{p} vs . H_{1} : H_{0} is not true & (4) \end{array}

To test the null hypothesis (H₀) against the alternative hypothesis (H₁), the F-statistic is used if the parameters are estimated by the least-squares estimators (LSEs). On the other hand, if the parameters are estimated by the maximum likelihood estimators (MLEs), the likelihood ratio test (LRT) statistic is used to test the null hypothesis. Because of our interest in the improvement of classical MLE based regression analysis, let us first introduce LRT statistic as follows:

\begin{array}{l} χ_{0}^{2} = - 2 log λ = - 2 log \frac{L ({\tilde{θ}}_{0} | Y)}{L ({\hat{θ}}_{1} | Y)} = 2 n {L ({\hat{θ}}_{1} | Y) - L ({\tilde{θ}}_{0} | Y)} \\ = log ({\tilde{σ}}^{2} / {\hat{σ}}^{2}) & (5) \end{array}

which follows ~χ²−distribution with 1 degree of freedom, where ${\tilde{θ}}_{0} = ({\tilde{γ}}_{0}, {\tilde{σ}}^{2})$ and ${\hat{θ}}_{1} = (\hat{γ}, {\hat{σ}}^{2})$ with $\hat{γ} = ({\hat{γ}}_{0}, {\hat{γ}}_{1}, \dots, {\hat{γ}}_{p})$ are the MLEs of $θ_{0} = (γ_{0}, σ^{2})$ and $θ_{1} = (γ, σ^{2})$ with γ = (γ₀, γ₁,…, γ_p) under H₀ and H₁, respectively. Here:

\begin{array}{l} {\tilde{γ}}_{0} = ȳ, & (6) \end{array}

\begin{array}{l} {\tilde{σ}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\tilde{γ}}_{0})}^{2} = \frac{1}{n} {(Y - γ_{0} 1_{1})}^{'} (Y - γ_{0} 1_{1}), & (7) \end{array}

\begin{array}{l} \hat{γ} = {(X^{'} X)}^{- 1} X^{'} Y, & (8) \end{array}

\begin{array}{l} and {\hat{σ}}^{2} = \frac{1}{n} {(Y - X \hat{γ})}^{'} (Y - X \hat{γ}) & (9) \end{array}

Note that the least squares estimator (LSE) and maximum likelihood estimator (MLE) of γ are exactly the same, but not for σ². The LSE of σ² has divisor (n - p) whereas the MLE of σ² has divisor n instead of (n–p). However, it is well-known that both classical LSE and MLE for the regression parameters are very much sensitive to outliers. So, regression analysis by anyone of classical LSE and MLE produces misleading results in the presence of outlying observations.

To overcome this problem, there exists several robust regression approaches like M-Huber, M-Hampel, M-Tukey, LTS, LMS, MM, S, fast-S, and so on in the literature. The major role of robust regression analysis is to provide an excellent fit to the reasonable space of the data. Generally, efficiency and breakdown point is used to investigate the performance of robust techniques. Efficiency can tell us how well-robust techniques perform relative to least squares on clean data. High efficiency is mostly desired on estimation. The breakdown point is a measure for the efficiency of the estimator when the sample contains a significant fraction of outliers (Hampel, 1975). Some robust regression estimates asymptotic breakdown point is a maximum of 50%. Rousseeuw (1984) introduced the LMS (least median of squares) and least trimmed squares (LTS) estimators for robust estimation. Rousseeuw and Yohai (1984) introduced S-estimator, which is an equivariant estimator for linear regression with asymptotic breakdown point 0.5. MM-estimates (Yohai, 1987) and the τ-estimates (Yohai and Zamar, 1988) are also the more efficient estimates with asymptotic breakdown point 0.5. A modification of the subsampling algorithm for the LTS estimator was proposed by Rousseeuw and Van Driessen (2006), which is called fast-LTS that considerably improves its performance. Moreover, the fast-LTS estimator is much faster than the approximating algorithm for the LTS estimator. Recently, Salibian-Barrera and Yohai (2006) proposed a fast-S algorithm using Rupperts SURREAL algorithm (Ruppert, 1992) similar to the fast-LTS of Rousseeuw and Van Driessen (2006). However, most of the robust approaches have some drawbacks. For example, the robustness performance of those approaches gradually becomes weaker if the number of explanatory variables increases in the linear regression model (Rousseeuw and Yohai, 1984; Salibian-Barrera and Yohai, 2006).

Therefore, in this paper, a new robust regression method by maximizing the β-likelihood function is proposed, and its performance is compared with the existing approaches including LS, M-Huber (Huber, 1973), M-Hampel (Hampel, 1974), M-Tukey (Beaton and Tukey, 1974), LTS, LMS, MM, S, and as well as fast-S (Salibian-Barrera and Yohai, 2006). Simulation and real data analysis were carried out in this study to inspect the performance of the proposed method. Mean square error (MSE) was also estimated to investigate the performance of the proposed methods when the number of explanatory variables is increased in the regression model with the absence and presence of outliers. Also, relative efficiency analysis was carried out to check the efficiency of the proposed estimator compares with other popular estimators such as S, MM, and fast-S for normal error distribution in case high dimension and outliers in this study. Moreover, real data analysis was performed to confirm the performance of the proposed method. Furthermore, we have applied GWAS for testing the performance of our proposed method for the identification of important genes/SNPs related to grain quality in rice.

Methods and Materials

Robustification of MLE Based Linear Regression Analysis by Maximizing β-Likelihood Function

The β-likelihood function for the model parameter $θ = (γ_{0}, γ_{1}, \dots, γ_{p}, σ^{2})$ of Equation (2) is defined as:

\begin{array}{l} L_{β} (θ | Y) = \frac{1}{β} {\frac{1}{n l_{β} (θ)} \sum_{i = 1}^{n} f_{θ}^{β} (y_{i}) - 1} & (10) \end{array}

where

\begin{array}{l} l_{β} (θ) = {\int f_{_{θ}}^{1 + β} (y) d y}^{^{β} /_{1 + β}} = {(1 + β)}^{-^{β} /_{2 (1 + β)}} \\ {(2 π σ^{2})}^{-^{β^{2}} /_{2 (1 + β)}} . & (11) \end{array}

We call L_β (θ | Y) in Equation (10) as the β-likelihood function, since it reduces to the average of log-likelihood function (3) for β → 0. The β-likelihood function (10) is induced from the β-divergence (Mollah et al., 2010). This divergence is also known as density power divergence (Basu et al., 1998; Mihoko and Eguchi, 2002). It measures the discrepancy between two density functions. The minimizer of β-divergence is equivalent to maximizer of β-likelihood function. To robustify linear regression analysis, we estimate the model parameter vector θ by maximizing β-likelihood function under hypotheses H₀ and H₁, respectively. To test H₀ against H₁ as defined in equation (4), the proposed test statistic (criterion) is defined by:

\begin{array}{l} χ_{β}^{2} = 2 n {L_{β} ({\hat{θ}}_{1 β}) - L_{β} ({\tilde{θ}}_{0 β})}, & (12) \end{array}

which we call β-test statistic, where ${\tilde{θ}}_{0 β} = ({\tilde{γ}}_{0 β}, {\tilde{σ}}_{β}^{2})$ and ${\hat{θ}}_{1 β} = ({\hat{γ}}_{β}, {\hat{σ}}_{β}^{2})$ with ${\hat{γ}}_{β} = ({\hat{γ}}_{0 β}, {\hat{γ}}_{1 β}, \dots, {\hat{γ}}_{p β})$ are the maximum β-likelihood estimators (β-MLE) of $θ_{0} = (γ_{0}, σ^{2})$ and $θ_{1} = (γ, σ^{2})$ with γ = (γ₀, γ₁, …, γ_p) under H₀ and H₁, respectively. Under H₀, the estimates ${\tilde{θ}}_{0 β} = ({\tilde{γ}}_{0 β}, {\tilde{σ}}_{β}^{2})$ of $θ_{0} = (γ_{0}, σ^{2})$ are obtained iteratively as follow:

\begin{array}{l} {\tilde{γ}}_{0 β, t + 1} = \frac{\sum_{i = 1}^{n} w_{β} (y_{i} | {\tilde{θ}}_{0 β, t}) y_{i}}{\sum_{i = 1}^{n} w_{β} (y_{i} | {\tilde{θ}}_{0 β, t})} = {W^{'}}_{0 β, t} Y {({W^{'}}_{0 β, t} 1)}^{- 1}, & (13) \end{array}

\begin{array}{l} {\tilde{σ}}_{_{β, t + 1}}^{2} = \frac{\sum_{i = 1}^{n} w_{β} (y_{i} | {\tilde{θ}}_{0 β, t}) {(y_{i} - {\tilde{γ}}_{0 β, t})}^{2}}{\sum_{i = 1}^{n} w_{β} (y_{i} | {\tilde{θ}}_{0 β, t})} \\ = {W^{'}}_{0 β, t} [(Y - {\tilde{γ}}_{0 β, t} 1) # (Y - {\tilde{γ}}_{0 β, t} 1)] {({W^{'}}_{0 β, t} 1)}^{- 1} & (14) \end{array}

where $W_{0 β} = {[w_{β} (y_{i} | θ_{0})]}_{n \times 1} and 1 = (1, 1, \dots {, 1)}^{'} .$ Here we call:

\begin{array}{l} w_{β} (y_{i} | θ_{0}) = exp {- \frac{β}{2 σ^{2}} {(y_{i} - γ_{0})}^{2}} & (15) \end{array}

as the β-weight function under H₀. It produces smaller weights for outlying/contaminated observations. The notation “#” denotes the Hadamard product.

Under H₁, the estimate ${\hat{θ}}_{1 β} = ({\hat{γ}}_{β}, {\hat{σ}}_{β}^{2})$ of $θ_{1} = (γ, σ^{2})$ are obtained iteratively as follows:

\begin{array}{l} {\hat{γ}}_{β, t + 1} = {(X^{'} X_{β, t})}^{- 1} {X^{'}}_{β, t} Y, & (16) \end{array}

\begin{array}{l} {\hat{σ}}^{2}_{β, t + 1} = {W^{'}}_{1 β, t} [(Y - X {\hat{γ}}_{β, t}) # (Y - {\hat{γ}}_{β, t})] {({W^{'}}_{1 β, t} 1)}^{- 1}, & (17) \end{array}

where $X_{β} = {[X # W_{1 β} {1^{'}}_{1}]}_{n \times (p + 1)}, 1_{1} = {(1, 1, \dots, 1)}^{'} and W_{1 β} = {[w_{β} (y_{i} | θ_{1})]}_{n \times 1}$ .

Here $w_{β} (y_{i} | θ_{1}, X) = exp {- \frac{β}{2 σ^{2}} {(y_{i} - γ_{0} - \sum_{j = 1}^{p} γ_{j} x_{i j})}^{2}}$ is the β-weight function under H₁. This weight function produces smaller weights for outlying/contaminated observations. The proposed test statistic (12) reduces to the classical MLE based test statistic (5).

Hypothesis Testing

To test the null hypothesis (H₀) against the alternative hypothesis (H₁) as defined in (4) from the robustness point of view, we can compute p-values using asymptotic χ² distribution as defined in (5) by replacing ${\tilde{σ}}^{2}$ and ${\hat{σ}}^{2}$ by the proposed estimates ${\tilde{σ}}_{β}^{2}$ and ${\hat{σ}}_{β}^{2}$ , respectively. However, we propose to use permutation-based p-values to test whether H₀ is true or false. To compute permutation-based p-values, first we compute the value of $χ_{β}^{2}$ as defined in (12) based on the original dataset. Then we permute the values of the response variable N times and each time we compute $χ_{β}^{2}$ . Then we compute the p-values (p) for testing H₀ against H₁ using the following formula:

\begin{array}{l} p = \sum_{k = 1}^{N} I_{[{\hat{χ}}_{β}^{2}_{(k)} \leq {\hat{χ}}_{β}^{2}]} / N, & (18) \end{array}

where ${\hat{χ}}_{β}^{2}$ denotes the estimate of $χ_{β}^{2}$ with the original dataset and ${\hat{χ}}_{β}^{2} (k)$ denotes the estimate of $χ_{β}^{2}$ with the kth permuted values of the response variable in the dataset. Note that, for β → 0, $χ_{β}^{2}$ reduces to the approximate χ²distribution.

Robustness of the Estimators

Suppose G be the distribution function of g, and then we can view the β-estimator (θ_β) of θ as an M-estimator, which is a function of G defined by:

\begin{array}{l} θ_{β} [G] = \underset{θ}{argmax} {\int Ω_{β} (y; θ) d G (y)} & (19) \end{array}

where $Ω_{β} (y; θ) = [\frac{1}{β l_{β} (θ)} f_{θ}^{β} (y) - 1]$ and $f_{θ} (y) = N (γ_{0} + \sum_{j = 1}^{p} γ_{j} x_{i j}, σ^{2})$ . Then the influence function (IF) for the β-estimator at y under the distribution function G is defined as:

\begin{array}{l} IF (y; θ_{β}, G) = lim_{ε \to 0} {θ_{β} [(1 - ε) G + ε Δ_{y}] - θ_{β} [G]} / ε & (20) \end{array}

where, Δ_y is the probability measure that puts mass one at the point y. An estimator is supposed to be B-robust if its influence function is a bounded function of y (Hampel et al., 1986). Since the β-estimator fulfills the properties of M-estimator, the influence function for the proposed estimator also can be written as:

\begin{array}{l} IF (y; θ_{β}, G) = H {(Ψ_{β}, G)}^{- 1} Ψ_{β} (y; θ_{β} [G]) & (21) \end{array}

where Ψ_β (y; θ) = δΩ_β (y; θ)/δθ is the estimating function for the β-estimator and

\begin{array}{l} H (Ψ_{β}, G) = - \int {[\frac{δ Ψ_{β} (y; θ)}{δ θ}]}_{θ = θ [G]} d G (y) & (22) \end{array}

is a matrix which does not rely on y; thus, the B-robustness is equivalent to the boundedness of the estimating function for the M-estimator as well as the β-estimator (Mihoko and Eguchi, 2002). To show the finiteness of the estimating function Ψ_β(y; θ) for the β-estimator, let us consider the general form of two variables linear regression models for estimating function as defined by:

\begin{array}{l} Ψ_{β} (y; θ) = [f_{θ}^{β} (y) \times \frac{δ {[l_{β} (θ)]}^{- 1}}{δ θ}] \\ + [{[l_{β} (θ)]}^{- 1} f_{θ}^{β} (y) \frac{δ log [f_{θ} (y)]}{δ θ}] & (23) \end{array}

Obviously, in the estimating function (23), (i) l_β(θ) is independent on observation (ii) $f_{θ}^{β} (y) \to 0$ for β > 0 and y → ∞, and (iii) log[ f_θ(y)] → ∞ for y → ∞. Thus, we observe that the boundedness of the estimating function depends only on the second term of the right-hand side of equation (23). We also observe that second term will be bounded for any y when β > 0. Thus, we can conclude that all components of estimating function (23) are bounded to any y when β > 0. Therefore, our proposed β-estimators are B-robust against outliers. However, the robustness performance of the proposed method depends on the value of the tuning parameter β and the initialization of the regressions parameters.

To obtain high-breakdown estimators with 50% breakdown points using Equations (13–17), we are proposing the following procedure for initialization of the regression parameters. Let $D = {z_{i} | z_{i} = (y_{i}, x_{i}), i = 1, 2, \dots, n}$ be the entire data space, where y_iis the ith component of Y and x_i is the ith row of X as defined in Equation (3). Let us define a subset $D_{o} \subset D$ based on Euclidean distance as follows:

\begin{array}{l} D_{o} = {z_{i} \in D | d_{i} = ∥ z_{i} - z_{m d} ∥ < ξ_{p} i = 1, 2, \dots, n} & (24) \end{array}

where ξ_p is the pth percentile of {d₁, d₂,…, d_n} with p = 50. The notation $‡_{m d}$ is the median vector as defined by:

\begin{array}{l} z_{md} = {(\underset{i = 1, 2, \dots, n}{median} (y_{i}), \underset{i = 1, 2, \dots, n}{median} (x_{1 i}), \dots, \underset{i = 1, 2, \dots, n}{median} (x_{p i}))}^{T} & (25) \end{array}

Then the appropriate initial values of θ for the proposed estimators (13–14) and (16–17) can be obtained by the classical estimators (6–7) and (8–9), respectively, using the reduced dataset $D_{o}$ . The proposed estimators of θ can resist the effect of 50% breakdown points if we compute the initial values of the location parameter vector μ and scatter matrix V by a good part of the dataset. If the initial value of θ close to the mean vector μ_i, then β-weight function produces larger weights (close to 1) for each data vector of the data cluster having mean the vector μ_i and the covariance matrix Vi and smaller weights (close to 0) for the data vectors of the other clusters, respectively. Thus, iterative Equations 16–17 can produce $({\hat{μ}}_{i, β}, {\hat{V}}_{i, β})$ for (μ_i, V_i) considering data vectors in other clusters as outliers. Similarly, other pairs of mean vector and covariance matrix can be estimated sequentially. A detailed discussion can be found in Mollah et al. (2010).

Breakdown points of the proposed estimates depend on the value of the tuning parameter β and initialization of the parameters in Equations 13–14. However, both factors work through the β-weight function as defined in Equation 15. To discuss in detail, let us rewrite the β-weight function as follows, which lies between 0 and 1. It produces a smaller weight (≥ 0) if the data vector (x) is unusual or contaminated by outliers and the more substantial weight (≤ 1) if x is usual/uncontaminated. Generally, a data vector (x) is known as unusual/contaminated if it is far away from the mean vector μ. The β-weight function is the negative exponential function of the squared Mahalanobis Distance (MD) defined by MD = (x − μ)′V⁻¹(x − μ) ≥ 0 between the data vector x and the mean vector μ. From equation 15, it is seen that the β-weight decreases (increases) when MD increases (decreases). Estimates become robust by putting the smaller weight corresponding to the unusual/contaminated data vector and larger weight corresponding to the usual/uncontaminated data vectors in Equations 16–17.

Selection of the Tuning Parameter β

The value of β acts a vital role in the performance of the proposed method. A larger β increases the robustness but decreases the efficiency of estimators and vice-versa for the smaller β. Therefore, an appropriate value of β may control the trade-off between the robustness and efficiency of estimators. To find an appropriate β for the maximum β-likelihood estimators used the negative of the β-likelihood function (10) with a defined value β₀ of β as a measure for evaluation of the β-estimators. In this study, we also use the same measure for β selection using cross-validation (Friedman et al., 2001). To determine the measure for β selection using K-fold cross-validation, let us split the entire dataset $D = {(y_{i}, x_{i}); i = 1, 2, \dots, n}$ into k subsets $D_{1}, D_{2}, {\dots, D}_{K}$ and let $D^{- k} = {(y_{i}, x_{i}) | (y_{i}, x_{i}) \notin D_{k}}, k = 1, 2, \dots, K$ . Then the estimate for β selection by K-fold cross-validation can be defined by:

\begin{array}{l} D_{β_{0}} (β) = \frac{1}{K} \sum_{k = 1}^{K} {L^{^{'}}}_{β_{0}} ({\hat{θ}}_{β} \ D_{k}) & (26) \end{array}

\begin{array}{l} where {L_{β_{0}}}^{'} ({\hat{θ}}_{β} \ D_{k}) = \frac{1}{β} [1 - \frac{1}{n_{k} {l_{β}}_{0} ({\hat{θ}}_{β})} \sum_{y \in D_{k}} \\ {f (y \ {\hat{θ}}_{β})}^{β_{0}}] & (27) \end{array}

\begin{array}{l} with l_{β_{0}^{}} ({\hat{θ}}_{β}) = \frac{1}{1 + β_{0}} [\int {f (y | {\hat{θ}}_{β})}^{1 + β_{0}}] & (28) \end{array}

Here ${\hat{θ}}_{β} = ({\hat{γ}}_{β}, {\hat{σ}}_{β}^{2})$ are estimated iteratively using Equations 16–17 based on the sub-dataset $D^{- k}$ and n_k is the number of observations in the sub-unit D_k. We choose a suitable β by the minimizer of D_β₀(β) for β. If the minimizer of D_β₀(β) find β > 0, it indicates that the dataset is contaminated by outliers. If it shows β = 0, which indicates that the dataset is not contaminated by outliers and the proposed method reduces to the MLE based regression approach. See Mollah et al. (2007, 2010) for more discussion about β selection by cross-validation.

Simulated Data

We have considered three different scenarios to investigate the performance of the proposed methods compared with most popular robust methods such as M-Huber, M-Hampel, M-Tukey, LMS, LTS, MM (Yohai, 1987), S (Rousseeuw and Yohai, 1984) and fast-S (Salibian-Barrera and Yohai, 2006) using simulation data. The details procedures for generating simulated data are given below:

Scenario-1: To investigate the influence of outlying observations in our proposed method in comparison with some popular existing methods as early mentioned, first we have generated a clean dataset that contains n = 100 pairs of observations (y_i, x_1i), i = 1, 2,…, 100 satisfying two-variable regression model with p = 1 in equation (1) by taking γ₀ = 1.0, γ₁ = 6.0 and assuming u ~ NID (0,σ² = 0.5). Then we contaminated few observations in y (response/dependent variable) artificially multiplying by 3 to 5 folds from random position to obtain three contaminated datasets having outlying observations 15, 30, and 45% in y, respectively. We have replicated each case 100 times to generate 100 datasets for each case of outliers and applied all methods, including the proposed method, to estimate the regression parameters in each case of outliers. To investigate the performance, we computed mean square error (MSE) defined as $M S E ({\hat{γ}}_{i}) = E_{γ} [{({\hat{γ}}_{i} - γ_{i})}^{2}] = V a r ({\hat{γ}}_{i}) + {[E ({\hat{γ}}_{i}) - γ_{i}]}^{2} = \sum_{j = 1}^{100} {({\hat{γ}}_{i j} - γ_{i})}^{2} / 100$ , where ${\hat{γ}}_{i j}$ is the estimate of the regression parameter γ_ij in the jth replication and i = 0, 1. To investigate the performance with two or more explanatory variables, we have generated 100 observations for the response variable (y) as before by using the linear regression model (1) with the number of explanatory variables p = 1, 10, 20, 30, and 40 and assuming u ~ NID (0,σ² = 0.5) in each case. Then we contaminated few observations in y (response/dependent variable) artificially multiplying by 3 to 5 folds from random position to obtain three contaminated datasets having outlying observations 15%, 30%, and 45% in y, respectively as before. We replicated each case 100 times as before to generate 100 datasets for each case and applied all methods, including the proposed method, to estimate the regression parameters in each case. Then MSE of the estimates have computed for the regression parameters $γ = {(γ_{0}, γ_{1} {, \dots, γ}_{p})}^{'}$ based on 100 replications. The average MSE of the estimates $\hat{γ}$ is calculated using $M S E_{p} (\hat{γ}) = \sum_{j = 1}^{100} {\sum_{i = 0}^{p} ({\hat{γ}}_{i j} - γ_{i})}^{2} / 100 (p + 1)$ , for p = 1, 10, 20, 30, and 40, respectively, where ${\hat{γ}}_{i j}$ is the estimates of γ_i in the jth replication.

Scenario 2: To investigate the influence of high leverage points (HLPs) in our proposed method in comparison some popular existing methods as early mentioned, first we have generated a clean dataset that contains n = 100 pairs of observations (y_i, x_1i), i = 1, 2,…, 100 satisfying two variables regression model with p = 1 in Equation (1) as before by taking γ₀ = 1.0, γ₁ = 6.0 and assuming u ~ NID (0, σ² = 0.5). Then we contaminated few observations in x₁ (independent variable) artificially multiplying by 3 to 5 folds from random position to obtain four contaminated datasets having HLPs 5, 15, 30, and 45% in x₁, respectively. We have replicated each case 100 times to generate 100 datasets for each case of HLPs and applied all methods, including the proposed method, to estimate the regression parameters in each case of HLPs.

Scenario 3: Data were generated based on linear model (1) with p > 1 independent and identically distributed N (0, 1) predictors (p = 2, 8) and sample sizes n = 20, 50, 100, 200, 500, where all slopes and intercepts were set to 0 and error follows normal. In this case, 500 datasets were generated for each combination of p, n, and error distribution. The relative efficiency (RE) is calculated by using equation:

\begin{array}{l} R E = \frac{\sum_{j = 1}^{500} {‖ {\hat{γ}}_{j}^{M L E} ‖}^{2}}{\sum_{j = 1}^{500} {‖ {\hat{γ}}_{j} ‖}^{2}}, \end{array}

the ratio of total mean squared errors. It is noted that, ${\hat{γ}}_{j}^{M L E}$ is the distribution-specific maximum likelihood estimator (MLE) with the clean dataset and ${\hat{γ}}_{j}$ is the study estimator.

Real Datasets

We have used three freely available real datasets and one rice SNPs dataset generated in our laboratory for association analysis to explore the performance of the proposed method in comparison of other methods:

(i) Dataset 1, education-related dataset studied by Coleman et al. (1966). The data set contained information on 20 schools from the Mid-Atlantic and New England states. Variables are described as X1: staff salaries per pupil, X2: percent of white-collar fathers, X3: socioeconomic status composite deviation, X4: mean teacher's verbal test score, X5: mean mother's educational level (one unit is equal to two school years), and Y: verbal mean test score (all sixth graders). This dataset was contaminated by HLPs. To investigate the performance of the algorithms, we re-contaminated this dataset by additional HLPs and outliers. The data set displayed in Supplementary Table S1. In Supplementary Table S1, aa(bb) means aa is replaced by bb for recontamination of the dataset.

(ii) Dataset 2, mice hypertension data previously investigated by Sugiyama et al. (2001) which is available in the R/qtl package (Broman et al., 2003). This dataset was analyzed to investigate the genetic control of salt-induced hypertension in male mice obtained by backcrossing between the salt-sensitive c57BL/6J and the non-salt-sensitive A/J (A) inbred mouse strains.

(iii) Dataset 3, iron status for liver and spleen dataset in mice studied by Grant et al. (2006). A total of 284 F₂ mice were bred from F₁ with either C57BL/6J/Ola or SWR/Ola strains, and this data are available in R/qtlbook package as the iron data. The iron level (in μg/g) in the liver and spleen are the two phenotypes, and the ratio of males and females are almost equal in this dataset (Grant et al., 2006). We have used log₁₀ values for analysis of the data as the distribution was skewed from normality according to the previous study (Grant et al., 2006).

(iv) Dataset 4, a total of 138 RILs (recombinant inbred lines) population obtained from Xieyou9308, super hybrid rice, was used for the association study of both rice quality traits chalkiness degree (CD) and chalkiness percentage (CP) in this study. The details about field experiments and genotypes of DNA resequencing are described in the materials and method section in our previous study (Xu et al., 2015). We have obtained 701,867 SNPs from DNA resequencing. Then we remove the low-frequency SNPs (minor allele frequency <5%) and pruning LD correlated SNPs (r² > 0.4) using Plink. Subsequently, we have applied a generalized multifactor dimensionality reduction procedure to screen potential SNPs associated with traits (Zhu et al., 2013). Finally, we applied our proposed, LS, RLS, and fast-S methods to identify the important SNPs associated with CD and CP. The simple linear regression model was considered, where we included single SNPs in the model to estimate the effect of each SNPs, and the chi-square test was used for testing the significance of the SNPs.

Annotation of SNPs and Genes

We have used CARMO (comprehensive annotation of rice multi-omics data) for SNP annotation (Wang et al., 2015). Candidate genes of the identified SNPs were derived from the RAP-DB (http://rapdb.dna.affrc.go.jp/) and Oryzabase (https://shigen.nig.ac.jp/rice/oryzabase/). Moreover, we used different webpages such as PubMed, Google Scholar, Uniport, and Web of Science to know the functions of the identified genes.

Results

Simulation Data Analysis Results

Performance Investigation in the Presence of Outliers Using Simulation Scenario 1

Figure 1A represents the fitted regression lines by all methods, as mentioned in the method section with our proposed method in the absence of outliers. The straight lines represented by a solid line with black color, dash-dot line with pink color, dot line with yellow color, dot-dash line with gray color, a solid line with purple color, dot-dash line with orange color, a solid line with cyan color, dash-dot line with magenta color, dot-dot line with blue color, the solid line with red color, and dot-dash line with green color for LS, M-Huber, M-Hampel, M-Tukey, LMS, LTS, MM, S, the fast-S, proposed method and True line, respectively. We observe that all methods including the proposed method produce reasonable line similar to the true line (dot-dash line with green color), where the true line has drawn using the actual parameter γ₀ = 1.0 and γ₁ = 6.0 (Figure 1A). Figures 1B–D shows the fitted regression line with these three contaminated datasets by all methods, including our proposed method, and the descriptions of lines in these figures are the same as Figure 1A. Results showed that the proposed method, as well as LMS, LTS, MM, S, and fast-S, produced straight lines similar to the real line, while the other methods fail to produce reasonable lines for all of the three contaminated datasets (Figures 1B–D).

FIGURE 1

Figure 1. Simulation results of the two-variable regression model to fit regression lines. (A–D) The fitted regression line in absence and presence of 15, 30, and 45% outliers, respectively. LS, least-squares estimator; M-Huber, M-Huber estimator; M-Hampel, M-Hampel estimator; M-Tukey, M-Tukey estimator; LMS, least median of squares estimator; LTS, least trimmed squares estimator; MM, MM-estimator; S, S-estimator; fast-S, fast-S estimator; Proposed, Proposed estimator.

Besides, mean square error (MSE) has used to test the efficiency of the estimates and accordingly 100 datasets generated by replication for each case as mentioned above in this study. MSE results show that the proposed method's MSEs are smaller than other method's MSEs in most of the cases (Supplementary Table S2). An estimator having smaller MSE is said to be a good estimator. These results indicated that the proposed method's estimators are efficient estimators. The estimated values of the regression parameters and the MSE of the estimates are given in Supplementary Table S2. However, the performance of the proposed method depends on the value of the tuning parameter β. We observe that the proposed method produces reasonable estimates for any values of β within the range [0, 0.25] and [0.05, 0.25] for all uncontaminated and contaminated datasets in this setting, respectively (Figures 2A,B). Since smaller β decreases the power of robustness but increases the efficiency of the estimates and vice-versa for the larger β (Basu et al., 1998), we select β = 0.2 to control the trade-off between robustness and efficiency of the estimates for all datasets in this setting. Additionally, the box plot results show that the estimated values of β were nearby zero and >0.14 in the absence and presence of outliers, respectively (Figures 2C,D). Since the β-MLE estimators with β = 0 reduces to the classical MLE, the above results indicated that the proposed estimators reduce to the classical MLE in the absence of outliers.

FIGURE 2

Figure 2. β-selection by cross-validation (CV) and the box plot of beta values. (A,B) Beta-selection results using simulation data in the absence and presence of outlier, respectively. (C,D) Box plot of beta values without and with outliers, respectively.

Furthermore, we have investigated the performance of the proposed method when the number of explanatory variables is increased in the regression model. Results showed that LS, M-Tukey, and the proposed method produce smaller MSE than other methods for each case of explanatory variables in the absence of outliers (Figure 3A). In the case of 15% outliers, results show that only MM, fast-S, and the proposed method produced almost stable and smaller MSE than other methods in each case of explanatory variables increased in the model (Figure 3B). However, in the case of 30 and 45% outliers in the datasets, only the proposed method produces stable and smaller MSE than other methods in each case of explanatory variables (Figures 3C,D). MSE values for each case of explanatory variables are given in Supplementary Table S3. These results indicate that the proposed method shows better performance than traditional methods (M-Huber, M-Hampel, M-Tukey, LTS, LMS, MM, S, and fast-S) in case of both absence and presence of outliers.

FIGURE 3

Figure 3. Plots of MSE with respect to the number of explanatory variables. (A–D) In the absence and presence of 15, 30, and 45% outliers, respectively. LS, least-squares estimator; M-Huber, M-Huber estimator; M-Hampel, M-Hampel estimator; M-Tukey, M-Tukey estimator; LMS, least median of squares estimator; LTS, least trimmed squares estimator; MM, MM-estimator; S, S-estimator; fast-S, fast-S estimator; Proposed, Proposed estimator.

Performance Investigation in the Presence of High Leverage Points (HLP) Using Simulation Scenario 2

In this study, the performance of the proposed method was investigated in the presence of 5, 15, 30, and 45% contaminated high leverage points (HLP) in the dataset. Then all of the methods, including the proposed method, were applied to fit the regression line for each case of contamination. Results showed that the proposed method including LMS, LTS, MM, S, and fast-S produced straight lines similar to the true line in the presence of 5 and 15% HLP in the dataset, while the other methods failed to produce reasonable lines in these cases, where the true line is drawn using the true values of the parameter γ₀ = 1.0 and γ₁ = 6.0 in the regression model (Figures 4A,B). Interestingly, results showed that only the proposed method produces reasonable lines similar to the true line, while all the other methods fail to produce reasonable lines in cases of 35 and 45% HLPs (Figures 4C,D). Similar to the previous section, estimator efficiency was also investigated by generating 100 datasets with replications for each case, as stated above. The average estimated values of the regression parameters and the MSE of the estimates are given in Supplementary Table S4. Results showed that the MSEs of the proposed estimates are smaller than other estimates in most of the cases, indicating that the proposed method performed better also in cases of HLPs in the datasets (Supplementary Table S4).

FIGURE 4

Figure 4. Simulation results for two-variable regression analyses in the presence of high leverage points. (A–D) The fitted regression line in the presence of 5, 15, 30, and 45% high leverage points, respectively. LS, least-squares estimator; M-Huber, M-Huber estimator; M-Hampel, M-Hampel estimator; M-Tukey, M-Tukey estimator; LMS, least median of squares estimator; LTS, least trimmed squares estimator; MM, MM-estimator; S, S-estimator; fast-S, fast-S estimator; Proposed, Proposed estimator.

Performance Investigation Using Relative Efficiency in Case of Simulation Scenario 3

To examine the performance of the proposed method in point of finite sample efficiency and robustness, the proposed method results were compared with the most well-known estimator S, MM, and fast-S. Details about data are given in the method and material section. Figure 5 displays efficiency results for linear regression with p = 2 and p = 8, respectively. The overall summary from the results is that the proposed estimator is highly efficient, even with sample sizes as small as n = 20 or n = 500. It is worth mentioning that the proposed estimator is asymptotically more efficient than S, MM, and fast-S estimators. We also investigate the efficiencies in case of data contamination, and results show that S, MM, and fast-S estimators less efficient than the proposed estimator (Figure 5). Therefore, we conclude that the efficiencies of the proposed estimator for normal error distribution are less affected by high dimension and outliers than the three popular estimators such as S, MM, and fast-S.

FIGURE 5

Figure 5. Efficiency with respect to maximum likelihood estimator (MLE) for p = 2 and p = 8 variables regression model with normal errors for sample sizes n = 20, 50, 100, 200, and 500. (A,B) Efficiency concerning MLE for p = 2 in the absence of outliers and the presence of 10% outliers, respectively. (C,D) Efficiency concerning MLE for p = 8 in the absence of outliers and the presence of 10% outliers, respectively. LS, least-squares; S, S-estimator; fast-S, fast-S estimator; MM, MM-estimator; Proposed, Proposed estimator.

Real Data Analysis Results

Performance Investigation Based on Coleman Data

We have used the real dataset drawn from a population studied by Coleman et al. (1966) to reveal the performance of the proposed method in comparison with LS, RLS, and fast-S method. Details information of this dataset is given in the method and material section. Mosteller and Tukey (1977) analyzed this dataset using LS and RLS (reweighted LS), and the results are also described by Rousseeuw and Leroy (1987). They reported that RLS showed better performance than LS. To compare their results with the fast-S and our proposed method, we applied both algorithms in the original dataset. Then we have applied LS, RLS, fast-S, and the proposed method on the contaminated dataset. To apply our proposed method, first we have selected the appropriate β by cross-validation for both datasets. Figures 6A,B shows the cross-validation results for original and contaminated datasets, respectively. Using “one standard error rule” we select β = 0.2 for both datasets. Then we applied our proposed method with β = 0.2 in both datasets. The proposed method produces β-weight for each data point of both datasets. Figures 6C,D shows the β-weights for each data point of original and contaminated datasets, respectively. From Figure 6C, we observe that data points 3, 13 and 19 are contaminated by outliers/HLPs in the original dataset, while the β-weight plot in Figure 6D indicates data points 3, 4, 7, 10, 12, 16, and 18 are contaminated by outliers/HLPs in the contaminated dataset. The results for both datasets are given in Table 1. We observe that the LS method shows only X3 and X4 are significant with p < 0.05 for the original dataset, while RLS, fast-S, and the proposed method showed all explanatory variables are significant for regression analysis with p < 0.05.

FIGURE 6

Figure 6. β-selection by cross-validation (CV) and β-weight plot. (A,B) β-selection by cross-validation (CV) for original and recontaminated datasets, respectively. (C,D) β-weight plot for original and recontaminated data, respectively.

TABLE 1

Table 1. Results for coleman data by LS, RLS, fast-S, and the proposed method.

It should be noted here that results for LS and RLS methods are given in Rousseeuw and Leroy (1987). However, in the case of the contaminated dataset, only the proposed method showed that all explanatory variables are significant for regression analysis with a p < 0.05. Thus, it could be concluded that the proposed method is more stable than LS, RLS, and fast-S methods for regression analysis.

Detection of Essential Genes (QTLs) Influencing Hypertension of Mice Real Dataset

We also investigate the performance of our proposed method for detecting important genes influencing hypertension of mice in comparison of traditional interval mapping (IM) (Lander and Botstein, 1989) and HK-Reg (Haley and Knott, 1992) model using the real dataset of Sugiyama et al. (2001). Figure 7A represents the blood pressure (hypertension) of 250 male progeny backcrosses to B6. Figure 7B shows the scatter plot of the contaminated dataset with outliers (+). Figure 7C shows that two QTLs on chromosome 1 and two QTL on chromosome 4 are statistically highly significant, and one QTL on each of chromosomes 2, 5, 6, 8, and 15 are suggestive by all three methods for the uncontaminated real dataset. For more clearly observe the results, we draw Figures 7E,F, which showed the same result as in Figures 7C,D for the specific chromosomes. However, in the presence of outliers, almost similar results are obtained only by the proposed method, as shown in Figure 7D, which indicates that the proposed method significantly improves the performance over the traditional IM and HK- Regression methods.

FIGURE 7

Figure 7. Mouse Genome data analysis results for the backcross population. (A,B) Scatter plot of real traits (hypertension of mice) with the indexes of their sampling units in the absence and presence of phenotypic outliers, respectively. (C,D) LOD scores by the IM, HK-Regression, and the proposed methods in the absence and presence of 10% outliers, respectively. (E–F) Specific chromosomes LOD scores by the IM, HK-Regression, and the proposed methods in the absence and presence of 10% outliers, respectively.

Identification of Vital QTLs Regulating the Basal Liver and Spleen Iron Level in Mice

We have used another mice data set to investigate the performance of the proposed method to compare with the traditional methods (IM and HK-Reg) to identify the crucial genes (QTLs) controlling the iron status in mice. Scatter plot of the non-heme iron level in the liver of 284 mice in absence and contaminated dataset with 10% outliers (+) represent by Figures 8A,B, respectively. QTL search results show that two highly significant (P < 0.01) polymorphic loci on chromosomes 2 and 16 and four suggestive QTLs on chromosomes 8, 11, 15, and 19 were identified by interval mapping (IM), HK-Reg and proposed method when using the original data (Figure 8C). We have artificially contaminated this dataset by 10% outliers to investigate the performance of our proposed method compared with existing methods. Results show that IM gave misleading results and failed to detect QTL positions correctly in the presence of outliers (Figure 8D). HK-Reg method was also unable to detect the QTL position in the absence of outliers (Figure 8D). However, the proposed method gives almost similar results in the presence of outliers for the iron level in the liver (Figure 8D).

FIGURE 8

Figure 8. Genomic scan for the iron status in the liver of combined male and female sexes. (A,B) Scatter plot of real traits (iron level in the liver) with the indexes of their sampling units in the absence and presence of phenotypic outliers, respectively. (C,D) LOD scores by the IM, HK-Reg, and the proposed methods in the absence and presence of 10% outliers, respectively.

Figures 9A,B shows the scatter plot of the non-heme iron level in the spleen of 284 mice in the absence and contaminated dataset with 10% outliers (+), respectively. We have applied the same procedure to generate the artificial dataset, as mention above. QTL analysis results show that one locus in each chromosome 8 and 9 was highly significant (p < 0.01) and one suggestive QTL on chromosomes 2 were identified by IM, HK-Reg and proposed method in the absence of outliers but only proposed method identified the original position of these QTLs in the presence of 10% outliers for spleen variable (Figures 9C,D). We have only considered the identified QTLs and their positions in Figures 9E,F to observe the results clearly demonstrated in Figures 9C,D. These results suggest that the proposed method considerably improves the performance over the traditional IM and HK- Regression methods for identifying the QTL determine the basal iron status in the liver and spleen. All of the real data (education, mice hypertension, and iron data) analysis results show that the proposed method demonstrates robust properties with respect to data contaminations, overcome the weakness of the traditional methods, and identify the vital gene influencing hypertension and iron level in the liver and spleen of mice.

FIGURE 9

Figure 9. Genomic scan for the iron status in the spleen of combined male and female sexes. (A,B) Scatter plot of real traits (iron level in spleen) with the indexes of their sampling units in the absence and presence of phenotypic outliers, respectively. (C,D) LOD scores by the IM, HK-Reg, and the proposed methods in the absence and presence of 10% outliers, respectively. (E,F) Specific chromosomes LOD scores by the IM, HK-Reg, and the proposed methods in the absence and presence of 10% outliers, respectively.

Identification of Important SNPs Associated With Rice Quality Traits

Chalkiness is an essential factor influencing the grain character of rice and principally governed by genetic elements (Zhu et al., 2018). Chalkiness percentage (CP), the ratio of total grains that is chalky, and chalkiness degree (CD), the proportion of the whole area of a kernel, are two key characteristics of rice grain. We have compared the results of the proposed method with the most well-known least squares (LS), reweighted least square (RLS), and a robust method, namely fast-S, to investigate the performance. A total of 15 significant SNPs that were significantly associated with the chalkiness degree identified by our proposed method, whereas only 10 SNPs identified by both LS and RLS method with p < 1.4 × 10⁻⁵ (Table 2). However, only three significant SNP identified by the fast-S method for the CD variable (Table 3). The genomic position, SNP location, annotation, and gene symbol synonym of each SNP for CD variable is given in Supplementary Table S5. Results showed that five SNPs corresponded to genes, which are encoded a hypothetical protein. SNP rs34767210 lies in the intronic region of the candidate gene OsRH3, which is encoded similar to ATP-dependent RNA helicase. Another SNP rs257317 lies in the exonic region of the candidate gene OsPGIP4, which encodes polygalacturonase-inhibiting proteins. Furthermore, another SNP rs24313516 lies in the upstream part of the candidate gene OsMOC1 (Os06g0610300), which encodes hypothetical protein.

TABLE 2

Table 2. SNPs identification for chalkiness degree (CD) of rice controlling by only additive effects using LS, RLS, fast-S, and the proposed method.

TABLE 3

Table 3. SNPs identification for chalkiness percentage (CP) of rice controlling by only additive effects using LS, RLS, fast-S, and the proposed method.

For the CP variable, a total of 21 significant SNPs identified by our proposed method and RLS method, whereas the LS method identified 20 SNPs, but the fast-S method identified only 13 SNPs with P < 1.6 × 10⁻⁵ in this study (Table 3). Identified SNPs and their related information are given in Supplementary Table S6. The identified SNPs were located in different regions of different genes. Among those SNPs, five SNPs lie in the intergenic region, which was encoded hypothetical protein (Supplementary Table S6). More importantly, six SNPs were corresponding to the specific genes and lay in upstream, downstream, intronic and intergenic regions, and other SNPs were involved in different functions (Supplementary Table S6). For example, SNP rs154937 lies in the upstream region corresponding to the variant of Os05g0102600, which is encoded zinc finger (Znf) containing protein. SNPs rs26357917 is the variant of genes Os07g0634900 encoded MYB transcription factors (TFs) family, which is, lies in the downstream region. Another SNP rs25982235 lies in the intronic region of the corresponding gene OsDEP1 (Os07g0627000), which is encoded chloroplastic alpha-glucanotransferase involved in maltotriose metabolism (https://www.uniprot.org/uniprot/Q8LI30). SNP rs16790346 is a variant of the PERSISTENT TAPETAL CELL 1 gene that lies in the intergenic region. SNP rs25558512 lies in the intergenic region of the candidate gene Os07g0618800, which encodes a calmodulin-like protein. Also, SNP rs26127356 lies in the upstream of the OsRNS8 (Os07g0629900), which encode Ribonuclease (RNases) T2 domain-containing protein. Two SNPs rs26502245 and rs490822 lie in the exonic and intergenic region of corresponding genes Os02g0656100 and Os05g0108450, respectively, were common in both CD and CP variables. The above results indicated that several mechanisms might be involved for controlling rice grain quality, including Znf protein and transcription factors.

Discussion

This paper proposed a new robust regression analysis approach by maximizing the β-likelihood function. The β-likelihood function is a limiting case of the log-likelihood function, and it is induced from β-divergence. The value of the important parameter β plays a critical role in the performance of the proposed method. An appropriate value for the tuning parameter β is selected by CV. The CV procedure finds β = 0 in the absence of outliers, while in the presence of outlying observations, it produces β > 0. The robustness properties of the proposed method are discussed using the influence function. Simulation and real data analysis were used to evaluate the performance of proposed β-estimators. Simulation results exhibit that the proposed method executes better than popular robust methods (i.e., M-Huber, M-Hampel, M-Tukey, LTS, LMS, MM, S, and fast-S) in case of both outliers and high leverage points. MSE of the proposed method is smaller than other methods for the small and large number of explanatory variables indicated that the proposed method's estimators are efficient estimators. Because it is said that an estimator having smaller MSE is a good estimator than others. Relative efficiency analysis results show that the proposed method outperforms than other robust methods (S, MM, and fast-S) in case of high dimension and outliers.

We performed GWAS on the different types of genome datasets for checking the performance of the proposed method to identify the biomarker genes/SNPs. Genome (mice hypertension and iron data) and education data analysis results demonstrated that the proposed method shows robust properties with respect data contaminations, overcome the drawback of the traditional methods and identify the essential variables related to education and vital gene influencing hypertension of mice and iron level in liver and spleen. Moreover, we introduced a new approach for GWAS to identify essential SNPs responsible for important agronomic traits. Grain quality consists of different sophisticated features such as appearance character, milled feature, nutritional characters, and cooking attribute, where grain appearance is correctly connected by grain shape and chalkiness (Zhu et al., 2018). High CP and CD are leading barriers to develop rice grain characteristics in China (Feng et al., 2017). Many QTLs have been identified, but only several rice chalkiness responsible QTLs were cloned or fine mapped due to the genetics difficulty and volatility of chalky quality till now (Zhu et al., 2018).

GWAS demonstrated that identified SNPs have different functions, including chalkiness specific function in this study. For example, SNPs rs34767210, rs257317, and rs24313516 corresponding to genes OsRH3, OsPGIP4, and MOC1 (MONOCULM 1), respectively for CP. Research showed that different essential functions, including ribosome biogenesis, translation origination, RNA synthesis, alteration, segmentation, and degradation governed by DEAD-box RNA helicases (Linder and Jankowsky, 2011). PGIPs (Polygalacturonase-inhibiting proteins), which are usually leucine-rice repeat (LRR) proteins, are assumed to perform vital roles for improving the resistance against a bacterial pathogen and also protection rice vs. fungi (Feng et al., 2016). MOC1, first identified gene controlling tiller number, regulated the origination and extension of auxiliary meristems at both vegetative and generative periods (Xu et al., 2012). Consequently, tiller and panicle branches affected by the loss of function MOC1 in rice (Xu et al., 2012).

We have identified six important SNPs associated with the chalkiness percentage in this study. Those SNPs were corresponding to different genes encoding Znf protein, TFs, calmodulin-like protein, and RNases T2 domain-containing protein. Research showed that Znf proteins have diverse functions, including gene transcription, translation, cell adhesion, protein folding, RNA packaging, chromatin remodeling, zinc identifying and lipid binding, and many more (Laity et al., 2001). A study showed that MYB protein regulating the hormonal reactions throughout seed sprouting and development (Ambawat et al., 2013). A recent study showed that the PTC1 gene regulating the tapetal growth and pollen formation in rice (Li et al., 2011). A target gene's expression could be activated or suppressed governing by plant calmodulin-like proteins as it binds DNA straightaway and functioning as TF (Hoeflich and Ikura, 2002). Another recent study demonstrated that OsRNS8 belongs to the S-RNases subfamilies whose genes related to tissue and stress (MacIntosh et al., 2010).

In this study, we have proposed an MLE-based robust regression method by maximizing the β-likelihood function. The selection of the tuning parameter β plays a crucial role in the performance of the proposed method. Note that the proposed method reduces to the classical MLE-based regression analysis for β = 0 in case of absence of outliers in the data, and the value of the tuning parameter β will be higher than zero in the presence of outliers in the data, which will be selected by the cross-validation procedures. Thus, a suitable value of β might assure the trade-off between the robustness and efficiency of estimators. It might be recognized that the standard linear regression procedure would be more beneficial than any other regression method in case of the absence of the outliers in the data, but any robust regression methods would be better if there are any outliers in the datasets. More importantly, it is very demanding and time consuming for the researchers to use an appropriate regression method if datasets have outliers. In this context, our proposed robust method would be appropriate than any other regression methods.

The efficiency of an estimator depends on the MSE value, and an estimator having smaller MSE than other estimators is said to be a good estimator. Simulation results suggested that the proposed method gives efficient estimators than other methods mentioned here in case of outliers and HLPs, elsewhere it shows similar performance in comparison with traditional methods. Also, our real data analysis including education, mice hypertension, and iron data showed that the proposed method gives efficient estimators and demonstrates robust properties with respect data contaminations, overcome the weakness of the traditional methods, and identify the vital gene influencing hypertension and iron level in liver and spleen of mice. Using the rice SNPs data, we have identified some highly significant SNPs associated with grain chalkiness traits in rice in this study. The identified SNPs might be useful for rice breeders to characterize the genes responsible for grain quality traits. Further validation of these SNPs might confirm their functions and provide the new resources for grain quality-related genes in rice. These results showed that the performance of the proposed method was better than other methods in both the absence and presence of outliers for artificial and real data analysis. Taken together, we conclude that the proposed method offers an important basis for further understanding of the robust regression analysis, which might be applied in various fields, including business, genetics, and bioinformatics.

Data Availability Statement

The plant SNP data generated for this article can be found here: https://figshare.com/s/1a3e7a233820d05fcdfe publicly available datasets were also used in this study. Information about these can be found in the article.

Author Contributions

MM conceptualization of the proposed method. MA and MM developed the method. HX provided the SNPs data, managed and organized the project. MA and MS performed the simulation and real data analysis. MA wrote the manuscript. MM and HX critically revised this manuscript. All authors have read and approved the final manuscript.

Funding

The work was financially supported by grants from the National Key Research and Development Program (2016YFC1303300), NSFC (31671570, 31871707), 111 Project (BP2018021), and Rajshahi University Research Project: A-1416/5/52/RU/Science-11/18-19.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We are grateful to the reviewers and editors for their valuable comments that help us to improve the manuscript. The authors are thankful to the International Postdoctoral Exchange Fellowship Program (Talent-Introduction Program) by China Postdoc Council (OCPC) and Zhejiang University, China for providing financial support during this research. We acknowledge the Dept. of Statistics, Rajshahi University, for providing us the facilities to develop this algorithm. We also acknowledge the Institute of Bioinformatics, Zhejiang University, for providing us the real SNP dataset to demonstrate the algorithm.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.00549/full#supplementary-material

References

Ambawat, S., Sharma, P., Yadav, N. R., and Yadav, R. C. (2013). MYB transcription factor genes as regulators for plant responses: an overview. Physiol. Mol. Biol. Plants 19, 307–321. doi: 10.1007/s12298-013-0179-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85, 549–559. doi: 10.1093/biomet/85.3.549

CrossRef Full Text | Google Scholar

Beaton, A. E., and Tukey, J. W. (1974). The Fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16, 147–185. doi: 10.1080/00401706.1974.10489171

CrossRef Full Text | Google Scholar

Broman, K. W., Wu, H., Sen, S., and Churchill, G. A. (2003). R/qtl: QTL mapping in experimental crosses. Bioinformatics 19, 889–890. doi: 10.1093/bioinformatics/btg112

PubMed Abstract | CrossRef Full Text | Google Scholar

Coleman, J. S., Campbell, E., Hobson, C., McPartland, J., Mood, A., Weinfeld, F., et al. (1966). Equality of educational opportunity, 2 Vol. Washington, DC: US Government Printing Office, Superintendent of Documents Catalog No. FS 5, 38001.

Feng, C. S., Zhang, X., Wu, T., Yuan, B., Ding, X. H., Yao, F. Y., et al. (2016). The polygalacturonase-inhibiting protein 4 (OsPGIP4), a potential component of the qBlsr5a locus, confers resistance to bacterial leaf streak in rice. Planta 243, 1297–1308. doi: 10.1007/s00425-016-2480-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Feng, F., Li, Y., Qin, X., Liao, Y., and Siddique, K. H. M. (2017). Changes in rice grain quality of indica and japonica type varieties released in China from 2000 to 2014. Front. Plant Sci. 8:1863. doi: 10.3389/fpls.2017.01863

PubMed Abstract | CrossRef Full Text | Google Scholar

Friedman, J., Hastie, T., and Tibshirani, R. (2001). The Elements of Statistical Learning. Springer series in statistics New York, NY. doi: 10.1007/978-0-387-21606-5

CrossRef Full Text | Google Scholar

Grant, G. R., Robinson, S. W., Edwards, R. E., Clothier, B., Davies, R., Judah, D. J., et al. (2006). Multiple polymorphic loci determine basal hepatic and splenic iron status in mice. Hepatology 44, 174–185. doi: 10.1002/hep.21233

PubMed Abstract | CrossRef Full Text | Google Scholar

Haley, C. S., and Knott, S. A. (1992). A Simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69, 315–324. doi: 10.1038/hdy.1992.131

PubMed Abstract | CrossRef Full Text | Google Scholar

Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Am. Stat. Assoc. 69, 383–393. doi: 10.1080/01621459.1974.10482962

CrossRef Full Text | Google Scholar

Hampel, F. R. (1975). Beyond location parameters: robust concepts and methods. Bull. Int. Stat. Inst. 46, 375–382.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986). Robust statistics. New York, NY: Wiley Online Library.

Google Scholar

Hoeflich, K. P., and Ikura, M. (2002). Calmodulin in action: diversity in target recognition and activation mechanisms. Cell 108, 739–742. doi: 10.1016/S0092-8674(02)00682-7

PubMed Abstract | CrossRef Full Text | Google Scholar

Huber, P. J. (1973). 1972 Wald Memorial lectures - robust regression - asymptotics, conjectures and monte-carlo. Ann. Stat. 1, 799–821. doi: 10.1214/aos/1176342503

CrossRef Full Text | Google Scholar

Laity, J. H., Lee, B. M., and Wright, P. E. (2001). Zinc finger proteins: new insights into structural and functional diversity. Curr. Opin. Struc. Biol. 11, 39–46. doi: 10.1016/S0959-440X(00)00167-6

PubMed Abstract | CrossRef Full Text | Google Scholar

Lander, E. S., and Botstein, D. (1989). Mapping mendelian factors underlying quantitative traits using rflp linkage maps. Genetics 121, 185–199.

PubMed Abstract | Google Scholar

Li, H., Yuan, Z., Vizcay-Barrena, G., Yang, C. Y., Liang, W. Q., Zong, J., et al. (2011). Persistent tapetal cell1 encodes a PHD-finger protein that is required for tapetal cell death and pollen development in rice. Plant Physiol. 156, 615–630. doi: 10.1104/pp.111.175760

PubMed Abstract | CrossRef Full Text | Google Scholar

Linder, P., and Jankowsky, E. (2011). From unwinding to clamping - the DEAD box RNA helicase family. Nat. Rev. Mol. Cell Biol. 12, 505–516. doi: 10.1038/nrm3154

PubMed Abstract | CrossRef Full Text | Google Scholar

MacIntosh, G. C., Hillwig, M. S., Meyer, A., and Flagel, L. (2010). RNase T2 genes from rice and the evolution of secretory ribonucleases in plants. Mol. Genet. Genomics 283, 381–396. doi: 10.1007/s00438-010-0524-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Mihoko, M., and Eguchi, S. (2002). Robust blind source separation by beta divergence. Neural Comput. 14, 1859–1886. doi: 10.1162/089976602760128045

PubMed Abstract | CrossRef Full Text | Google Scholar

Mollah, M. N., Sultana, N., Minami, M., and Eguchi, S. (2010). Robust extraction of local structures by the minimum beta-divergence method. Neural Netw. 23, 226–238. doi: 10.1016/j.neunet.2009.11.011

PubMed Abstract | CrossRef Full Text | Google Scholar

Mollah, M. N. H., Eguchi, S., and Minami, M. (2007). Robust prewhitening for ICA by minimizing β-divergence and its application to fastICA. Neural Process. Lett. 25, 91–110. doi: 10.1007/s11063-006-9023-8

CrossRef Full Text | Google Scholar

Mosteller, F., and Tukey, J. W. (1977). Data analysis and regression: a second course in statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods.

Google Scholar

Rousseeuw, P., and Yohai, V. (1984). “Robust regression by means of S-estimators,” in Robust and Nonlinear Time Series Analysis, eds P. Bickel, P. Diggle, S. Fienberg, U. Gather, and S. Zeger (New York, NY: Springer), 256–272. doi: 10.1007/978-1-4615-7821-5_15

CrossRef Full Text | Google Scholar

Rousseeuw, P. J. (1984). Least median of squares regression. J. Am. Stat. Assoc. 79, 871–880. doi: 10.1080/01621459.1984.10477105

CrossRef Full Text | Google Scholar

Rousseeuw, P. J., and Van Driessen, K. (2006). Computing LTS regression for large data sets. Data Min. Knowl. Disc. 12, 29–45. doi: 10.1007/s10618-005-0024-4

CrossRef Full Text | Google Scholar

Rousseeuw, P. J., and Leroy, A. M. (1987). Robust Regression and Outlier Detection. New York, NY: Wiley Online Library. doi: 10.1002/0471725382

PubMed Abstract | CrossRef Full Text | Google Scholar

Ruppert, D. (1992). Computing S estimators for regression and multivariate location/dispersion. J. Comput. Grap. Stat. 1, 253–270. doi: 10.2307/1390719

CrossRef Full Text | Google Scholar

Salibian-Barrera, M., and Yohai, V. J. (2006). A fast algorithm for S-regression estimates. J. Comput. Grap. Stat. 15, 414–427. doi: 10.1198/106186006X113629

CrossRef Full Text | Google Scholar

Sugiyama, F., Churchill, G. A., Higgins, D. C., Johns, C., Makaritsis, K. P., Gavras, H., et al. (2001). Concordance of murine quantitative trait loci for salt-induced hypertension with rat and human loci. Genomics 71, 70–77. doi: 10.1006/geno.2000.6401

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, J., Qi, M., Liu, J., and Zhang, Y. (2015). CARMO: a comprehensive annotation platform for functional exploration of rice multi-omics data. Plant J 83, 359–374. doi: 10.1111/tpj.12894

PubMed Abstract | CrossRef Full Text | Google Scholar

Xu, C., Wang, Y., Yu, Y., Duan, J., Liao, Z., Xiong, G., et al. (2012). Degradation of monoculm 1 by APC/CTAD1 regulates rice tillering. Nat. Commun. 3:750. doi: 10.1038/ncomms1743

PubMed Abstract | CrossRef Full Text | Google Scholar

Xu, H. M., Jiang, B. B., Cao, Y. J., Zhang, Y. X., Zhan, X. D., Shen, X. H., et al. (2015). Detection of epistatic and gene-environment interactions underlying three quality traits in rice using high-throughput genome-wide data. Biomed. Res. Int. 2015:135782. doi: 10.1155/2015/135782

PubMed Abstract | CrossRef Full Text | Google Scholar

Yohai, V. J. (1987). High breakdown-point and high-efficiency robust estimates for regression. Ann. Stat. 15, 642–656. doi: 10.1214/aos/1176350366

CrossRef Full Text | Google Scholar

Yohai, V. J., and Zamar, R. H. (1988). High breakdown-point estimates of regression by means of the minimization of an efficient scale. J. Am. Stat. Assoc. 83, 406–413. doi: 10.1080/01621459.1988.10478611

CrossRef Full Text | Google Scholar

Zhu, A., Zhang, Y., Zhang, Z., Wang, B., Xue, P., Cao, Y., et al. (2018). Genetic dissection of qPCG1 for a quantitative trait locus for percentage of chalky grain in rice (oryza sativa l.). Front. Plant Sci. 9:1173. doi: 10.3389/fpls.2018.01173

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhu, Z. X., Tong, X. R., Zhu, Z. H., Liang, M. M., Cui, W. Y., Su, K. K., et al. (2013). Development of GMDR-GPU for gene-gene interaction analysis and its application to WTCCC GWAS data for type 2 diabetes. PLoS ONE 8:e61943. doi: 10.1371/journal.pone.0061943

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: linear regression, robustness, β-divergence method, β-selection, β-test statistic, permutation test, GWAS, genes and SNPs

Citation: Alamin M, Sultana MH, Xu H and Mollah MNH (2020) Robustification of Linear Regression and Its Application in Genome-Wide Association Studies. Front. Genet. 11:549. doi: 10.3389/fgene.2020.00549

Received: 02 October 2019; Accepted: 07 May 2020;
Published: 08 June 2020.

Edited by:

Mariza De Andrade, Mayo Clinic, United States

Reviewed by:

Tian Qing Zheng, Chinese Academy of Agricultural Sciences, China
Yousaf Hayat, University of Agriculture, Peshawar, Pakistan
Daolong Wang, Syngenta, United States

Copyright © 2020 Alamin, Sultana, Xu and Mollah. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Haiming Xu, aG14dUB6anUuZWR1LmNu; Md. Nurul Haque Mollah, bW9sbGFoLnN0YXQuYmlvQHJ1LmFjLmJk

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.