Edited by: Jason W. Osborne, Old Dominion University, USA
Reviewed by: Matt Jans, The United States Census Bureau, USA; Avi Allalouf, National Institute for Testing and Evaluation, Israel
*Correspondence: W. Holmes Finch, Department of Educational Psychology, Ball State University, Muncie, IN 47304, USA. e-mail:
This article was submitted to Frontiers in Quantitative Psychology and Measurement, a specialty of Frontiers in Psychology.
This is an open-access article distributed under the terms of the
The presence of outliers can very problematic in data analysis, leading statisticians to develop a wide variety of methods for identifying them in both the univariate and multivariate contexts. In case of the latter, perhaps the most popular approach has been Mahalanobis distance, where large values suggest an observation that is unusual as compared to the center of the data. However, researchers have identified problems with the application of this metric such that its utility may be limited in some situations. As a consequence, other methods for detecting outlying observations have been developed and studied. However, a number of these approaches, while apparently robust and useful have not made their way into general practice in the social sciences. Thus, the goal of this study was to describe some of these methods and demonstrate them using a well known dataset from a popular multivariate textbook widely used in the social sciences. Results demonstrated that the methods do indeed result in datasets with very different distributional characteristics. These results are discussed in light of how they might be used by researchers and practitioners.
The presence of outliers is a ubiquitous and sometimes problematic aspect of data analysis. They can result from a variety of processes, including data recording and entry errors, obtaining samples from other than the target population, and sampling unusual individuals from the target population itself (Kruskal,
A number of authors have sought to precisely define what constitutes an outlier (e.g., Evans,
In the multivariate context, the most commonly recommended approach for outlier detection is the Mahalanobis Distance (
Outliers can have a dramatic impact on the results of common multivariate statistical analyses. For example, they can distort correlation coefficients (Marascuilo and Serlin,
While outliers can be problematic from a statistical perspective, it is not always advisable to remove them from the data. When these observations are members of the target population, their presence in the dataset can be quite informative regarding the nature of the population (e.g., Mourão-Miranda et al.,
Given the negative impact that outliers can have on multivariate statistical methods, their accurate detection is an important matter to consider prior to data analysis (Tabachnick and Fidell,
When thinking about the impact of outliers, perhaps the key consideration is the breakdown point of the statistical analysis in question. The breakdown point can be thought of as the minimum proportion of a sample that can consist of outliers after which point they will have a notable impact on the statistic of interest. In other words, if a statistic has a breakdown point of 0.1, then 10% of the sample could consist of outliers without markedly impacting the statistic. However, if the next observation beyond this 10% was also an outlier, the statistic in question would then be impacted by its presence (Maronna et al.,
While the breakdown point is typically thought of as a characteristic of a statistic, it can also be a characteristic of a statistic in conjunction with a particular method of outlier detection. Thus, if a researcher calculates the sample mean after removing outliers using a method such as
Another important property for a statistical measure of location (e.g., mean) is that it exhibit both location and scale equivariance (Wilcox,
Following is a description of several approaches for outlier detection. For the most part, these descriptions are presented conceptually, including technical details only when they are vital to understanding how the methods work. References are provided for the reader who is interested in learning more about the technical aspects of these approaches. In addition to these descriptions, Table
Method | Equation | Reference | Strengths | Weaknesses |
---|---|---|---|---|
Mahalanobis ( |
Intuitively easy to understand; easy to calculate; familiar to other researchers | Sensitive to outliers; assumes data are continuous | ||
MVE | Identify subset of data contained within the ellipsoid that has minimized volume | Rousseeuw and Leroy ( |
Yields mean with maximum possible breakdown point | May remove as much as 50% of sample |
MCD | Identify subset of data that minimizes the determinant of the covariance matrix | Rousseeuw and van Driessen ( |
Yields mean with maximum possible breakdown point | May remove as much as 50% of sample |
MGV | Calculate MAD version of |
Wilcox ( |
Typically removes fewer observations than either MVE or MCD | Generally does not have as high a breakdown point as MVE or MCD |
P1 | Identify the multivariate center of data using MCD or MVE and then determine its relative distance from this center (depth); use the MGV criteria based on this depth to identify outliers | Donoho and Gasko ( |
Approximates an affine equivariant outlier detection method; may not exclude as many cases as MVE or MCD | Will not typically lead to a mean with the maximum possible breakdown point |
P2 | Identify all possible lines between all pairs of observations in order to determine depth of each point | Donoho and Gasko ( |
Some evidence that this method is more accurate than P1 in terms of identifying outliers | Extensive computational time, particularly for large datasets |
P3 | Same approach as P1 except that the criteria for identifying outliers is |
Donoho and Gasko ( |
May yield a mean with a higher breakdown point than other projection methods | Will likely lead to exclusion of more observations as outliers than will other projection approaches |
The most commonly recommended approach for multivariate outlier detection is
where
A number of recommendations exist in the literature for identifying when this value is large; i.e., when an observation might be an outlier. The approach used here will be to compare
One of the earliest of alternative approach to outlier detection was the Minimum Volume Ellipsoid (MVE), developed by Rousseeuw and Leroy (
and calculate the volume of the ellipsoids created by each. The final sample to be used in further analyses is that which yields the smallest ellipsoid. An example of such an ellipsoid based on MVE can be seen in Figure
The minimum covariance determinant (MCD) approach to outlier detection is similar to the MVE in that it searches for a portion of the data that eliminates the presence and impact of outliers. However, whereas MVE seeks to do this by minimizing the volume of an ellipsoid created by the retained points, MCD does it by minimizing the determinant of the covariance matrix, which is an estimate of the generalized variance in a multivariate set of data (Rousseeuw and van Driessen,
As with MVE, the logistics of searching every possible subset of the data of size
One potential difficulty with both MVE and MCD is that they tend to identify a relatively large number of outliers when the variables under examination are not independent of one another (Wilcox,
As with MVE and MCD, MGV is an iterative procedure. In the first step the
where
In other words, MAD, the median absolute deviation, is the median of the deviations between each individual data point and the median of the data set,
would be considered outliers, where
where
Another alternative for identifying multivariate outliers is based on the notion of the depth of one data point among a set of other points. The idea of depth was described by Tukey (
For the purposes of this explanation, we will avoid presenting the mathematical equations that underlie the projection-based outlier detection approach. The interested reader is encouraged to refer to Wilcox (
A line is drawn connecting the multivariate center and point
A line perpendicular to the line in 1 is then drawn from each of the other observations,
The location where the line in 2 intersects with the line in 1 is the projected depth (
Steps 1–3 are then repeated such that each of the
For a given observation, each of its depth values,
As mentioned earlier, there is an alternative approach to the projection method, which is not based on finding the multivariate center of the distribution. Rather, all
The literature on multivariate outlier detection using the projection-based method includes two different criteria against which an observation can be judged as an outlier. The first of these is essentially identical to that used for the MGV in Eq.
where
The primary goal of this study was to describe alternatives to
The Women’s Health and Drug study that is described in detail in Tabachnick and Fidell (
In order to explore the impact of the various outlier detection methods included here, a variety of statistical analyses were conducted subsequent to the application of each approach. In particular, distributions of the three variables were examined for the datasets created by the various outlier detection methods, as well as the full dataset. The strategy in this study was to remove all observations that were identified as outliers by each method, thus creating datasets for each approach that included only those not deemed to be outliers. It is important to note that this is not typically recommended practice, nor is it being suggested here. Rather, the purpose of this study was to demonstrate the impact of each method on the data itself. Therefore, rather than take the approach of examining each outlier carefully to ascertain whether it was truly part of the target population, the strategy was to remove those cases identified as outliers prior to conducting statistical analyses. In this way, it was hoped that the reader could clearly see the way in which each detection method worked and how this might impact resulting analyses. In terms of the actual data analysis, the focus was on describing the resulting datasets. Therefore, distributional characteristics of each variable within each method were calculated, including the mean, median, standard deviation, skewness, kurtosis, and first and third quartiles. In addition, distributions of the variables were examined using the boxplot. Finally, in order to demonstrate the impact of these approaches on relational measures, Pearson’s correlation coefficient was estimated between each pair of variables. All statistical analyses including identification of outliers was carried out using the R software package, version 2.12.1 (R Foundation for Statistical Computing,
An initial examination of the full dataset using boxplots appears in Figure
Variable | Mean | |||||||
---|---|---|---|---|---|---|---|---|
Full ( |
MCD ( |
MVE ( |
MGV ( |
P1 ( |
P2 ( |
P3 ( |
||
TIMEDRS | 7.90 | 6.67 | 2.45 | 3.37 | 7.64 | 5.27 | 5.17 | 5.27 |
ATTDRUG | 7.69 | 7.67 | 7.69 | 7.68 | 7.68 | 7.65 | 7.64 | 7.65 |
ATTHOUSE | 23.53 | 23.56 | 23.36 | 23.57 | 23.50 | 23.54 | 23.54 | 23.54 |
TIMEDRS | 4.00 | 4.00 | 2.00 | 3.00 | 2.00 | 2.00 | 2.00 | 2.00 |
ATTDRUG | 8.00 | 8.00 | 8.00 | 8.00 | 7.00 | 7.00 | 7.00 | 7.00 |
ATTHOUSE | 24.00 | 24.00 | 23.00 | 23.00 | 21.00 | 21.00 | 21.00 | 21.00 |
TIMEDRS | 2.00 | 2.00 | 1.00 | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 |
ATTDRUG | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 |
ATTHOUSE | 21.00 | 21.00 | 21.00 | 21.00 | 21.00 | 21.00 | 21.00 | 21.00 |
TIMEDRS | 10.00 | 9.00 | 4.00 | 5.00 | 9.50 | 7.00 | 7.00 | 7.00 |
ATTDRUG | 9.00 | 8.25 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 |
ATTHOUSE | 27.00 | 26.25 | 26.00 | 26.00 | 26.50 | 26.00 | 26.75 | 26.00 |
TIMEDRS | 10.95 | 7.35 | 1.59 | 2.22 | 10.23 | 4.72 | 4.58 | 4.72 |
ATTDRUG | 1.16 | 1.16 | 0.89 | 0.81 | 1.15 | 1.56 | 1.52 | 1.56 |
ATTHOUSE | 4.48 | 4.22 | 3.67 | 3.40 | 4.46 | 4.25 | 4.26 | 4.25 |
TIMEDRS | 3.23 | 2.07 | 0.23 | 0.46 | 3.15 | 1.12 | 1.08 | 1.12 |
ATTDRUG | −0.12 | −0.11 | −0.16 | 0.01 | −0.12 | −0.09 | −0.09 | −0.09 |
ATTHOUSE | −0.45 | −0.06 | −0.03 | 0.06 | −0.46 | −0.03 | −0.03 | −0.03 |
TIMEDRS | 15.88 | 7.92 | 2.32 | 2.49 | 15.77 | 3.42 | 3.29 | 3.42 |
ATTDRUG | 2.53 | 2.51 | 2.43 | 2.31 | 2.54 | 2.51 | 2.51 | 2.51 |
ATTHOUSE | 4.50 | 2.71 | 2.16 | 2.17 | 4.54 | 2.69 | 2.68 | 2.69 |
Variable | TIMEDRS | ATTDRUG | ATTHOUSE |
---|---|---|---|
TIMEDRS | 1.00 | 0.10 | 0.13 |
ATTDRUG | 0.10 | 1.00 | 0.03 |
ATTHOUSE | 0.13 | 0.03 | 1.00 |
TIMEDRS | 1.00 | 0.07 | 0.08 |
ATTDRUG | 0.07 | 1.00 | 0.02 |
ATTHOUSE | 0.08 | 0.02 | 1.00 |
TIMEDRS | 1.00 | 0.25 | 0.19 |
ATTDRUG | 0.25 | 1.00 | 0.26 |
ATTHOUSE | 0.19 | 0.26 | 1.00 |
TIMEDRS | 1.00 | 0.33 | 0.05 |
ATTDRUG | 0.33 | 1.00 | 0.32 |
ATTHOUSE | 0.05 | 0.32 | 1.00 |
TIMEDRS | 1.00 | 0.07 | 0.10 |
ATTDRUG | 0.07 | 1.00 | 0.02 |
ATTHOUSE | 0.10 | 0.02 | 1.00 |
TIMEDRS | 1.00 | 0.06 | 0.10 |
ATTDRUG | 0.06 | 1.00 | 0.03 |
ATTHOUSE | 0.10 | 0.03 | 1.00 |
TIMEDRS | 1.00 | 0.04 | 0.11 |
ATTDRUG | 0.04 | 1.00 | 0.03 |
ATTHOUSE | 0.11 | 0.03 | 1.00 |
TIMEDRS | 1.00 | 0.06 | 0.10 |
ATTDRUG | 0.06 | 1.00 | 0.03 |
ATTHOUSE | 0.10 | 0.03 | 1.00 |
Given these distributional issues, the researcher working with this dataset would be well advised to investigate the possibility that outliers are present. For this example, we can use R to calculate
As discussed previously, there are some potential problems with using
Finally, in order to ascertain how the various outlier detection methods impacted relationships among the variables we estimated correlations for each approach, with results appearing in Table
The purpose of this study was to demonstrate seven methods of outlier detection designed especially for multivariate data. These methods were compared based upon distributions of individual variables, and relationships among them. The strategy involved first identification of outlying observations followed by their removal prior to data analysis. A brief summary of results for each methodology appears in Table
Method | Outliers removed | Impact on distributions | Impact on correlations | Comments |
---|---|---|---|---|
13 | Reduced skewness and kurtosis when compared to full data set, but did not fully eliminate them. Reduced variation in TIMEDRS | Comparable correlations to the full dataset | Resulted in a sample with somewhat less skewed and kurtotic variables, though they did remain clearly non-normal in nature. The correlations among the variables remained low, as with the full dataset | |
MVE | 230 | Largely eliminated skewness and greatly lowered kurtosis in TIMEDRS. Also reduced kurtosis in ATTHOUSE when compared to full data. Greatly lowered both the mean and standard deviation of TIMEDRS | Resulted in markedly higher correlations for two pairs of variables, than was seen with the other methods, except for MCD | Reduced the sample size substantially, but also yielded variables with distributional characteristics much more favorable to use with common statistical analyses; i.e., very little skewness or kurtosis. In addition, correlation coefficients were generally larger than for the other methods, suggesting greater linearity in relationships among the variables |
MCD | 230 | Very similar pattern to that displayed by MVE | Yielded relatively higher correlation values than any of the other methods, except MVE, and no very low values | Provided a sample with very characteristics to that of MVE |
MGV | 2 | Yielded distributional results very similar to those of the full dataset | Very similar correlation structure as found in the full dataset and for |
Identified very few outliers, leading to a sample that did not differ meaningfully from the original |
P1 | 40 | Resulted in lower mean, standard deviation, skewness, and kurtosis values for TIMEDRS when compared to the full data, |
Very comparable correlation results to the full dataset, as well as |
Appears to find a “middle ground” between MVE/MCD and |
P2 | 43 | Very similar results to P1 | Very similar results to P1 | Provided a sample yielding essentially the same results as P1 |
P3 | 40 | Identical results to P1 | Identical results to P1 | In this case, resulted in an identical sample to that of P1 |
There is not one universally optimal approach for identifying outliers, as each research problem presents the data analyst with specific challenges and questions that might be best addressed using a method that is not optimal in another scenario. This study helps researchers and data analysts to see the range of possibilities available to them when they must address outliers in their data. In addition, these results illuminate the impact of using the various methods for a representative dataset, while the R code in the Appendix provides the researcher with the software tools necessary to use each technique. A major issue that researchers must consider is the tradeoff between a method with a high breakdown point (i.e., that is impervious to the presence of many outliers) and the desire to retain as much of the data as possible. From this example, it is clear that the methods with the highest breakdown points, MCD/MVE, retained data that more clearly conformed to the normal distribution than did the other approaches, but at the cost of approximately half of the original data. Thus, researchers must consider the purpose of their efforts to detect outliers. If they are seeking a “clean” set of data upon which they can run a variety of analyses with little or no fear of outliers having an impact, then methods with a high breakdown point, such as MCD and MVE are optimal. On the other hand, if an examination of outliers reveals that they are from the population of interest, then a more careful approach to dealing with them is necessary. Removing such outliers could result in a dataset that is more tractable with respect to commonly used parametric statistical analyses but less representative of the general population than is desired. Of course, the converse is also true in that a dataset replete with outliers might produce statistical results that are not generalizable to the population of real interest, when the outlying observations are not part of this population.
There are a number of potential areas for future research in the area of outlier detection. Certainly, future work should use these methods with other extant datasets having different characteristics than the one featured here. For example, the current set of data consisted of only three variables. It would be interesting to compare the relative performance of these methods when more variables are present. Similarly, examining them for much smaller groups would also be useful, as the current sample is fairly large when compared to many that appear in social science research. In addition, a simulation study comparing these methods with one another would also be warranted. Specifically, such a study could be based upon the generation of datasets with known outliers and distributional characteristics of the non-outlying cases. The various detection methods could then be used and the resulting retained datasets compared to the known non-outliers in terms of these various characteristics. Such a study would be quite useful in informing researchers regarding approaches that might be optimal in practice.
This study should prove helpful to those faced with a multivariate outlier problem in their data. Several methods of outlier detection were demonstrated and great differences among them were observed, in terms of the characteristics of the observations retained. These findings make it clear that researchers must be very thoughtful in their treatment of outlying observations. Simply relying on Mahalanobis Distance because it is widely used might well yield statistical results that continue to be influenced by the presence of outliers. Thus, other methods described here should be considered as viable options when multivariate outliers are present. In the final analysis, such an approach must be based on the goals of the data analysis and the study as a whole. The removal of outliers, when done, must be carried out thoughtfully and with purpose so that the resulting dataset is both representative of the population of interest and useful with the appropriate statistical tools to address the research questions.
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
library(MASS)
mahalanobis.out < -mahalanobis(full.data,colMeans(full.data),cov(full.data))
mcd.output < -cov.rob(full.data,method = “mcd,” nsamp = “best”)
mcd.keep < -full.data[mcd.output$best,]
mve.output < -cov.rob(full.data,method = “mve,” nsamp = “best”)
mve.keep < -full.data[mve.output$best,]
mgv.output < -outmgv(full.data,y = NA,outfun = outbox)
mgv.keep < -full.data[mgv.output$keep,]
projection1.output < -outpro(full.data,cop = 2)
projection1.keep < -full.data[projection1.output$keep,]
projection2.output < -outpro(full.data.matrix,cop = 3)
projection2.keep < -full.data[projection2.output$keep,]
projection3.output < -outpro(full.data,cop = 4)
projection3.keep < -full.data[projection3.output$keep,]
*Note that for the projection methods, the center of the distribution may be determined using one of four possible approaches. The choice of method is specified in the
**The functions for obtaining the mahalanobis distance (