Volume: | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |

A peer-reviewed electronic journal. ISSN 1531-7714

Copyright is retained by the first or sole author, who grants right of first publication to |

Osborne, Jason W. & Amy Overbay (2004). The power of outliers (and why researchers should always check for them). Practical Assessment, Research & Evaluation, 9(6). Retrieved December 20, 2014 from http://PAREonline.net/getvn.asp?v=9&n=6 . This paper has been viewed 158,174 times since 3/2/2004.
Jason W. Osborne and Amy Overbay
The presence of outliers can lead to inflated error rates and substantial distortions of parameter and statistic estimates when using either parametric or nonparametric tests (e.g., Zimmerman, 1994, 1995, 1998). Casual observation of the literature suggests that researchers rarely report checking for outliers of any sort. This inference is supported empirically by Osborne, Christiansen, and Gunter (2001), who found that authors reported testing assumptions of the statistical procedure(s) used in their studies--including checking for the presence of outliers--only 8% of the time. Given what we know of the importance of assumptions to accuracy of estimates and error rates, this in itself is alarming. There is no reason to believe that the situation is different in other social science disciplines.
Although definitions vary, an outlier is generally considered to be a data point that is far outside the norm for a variable or population (e.g., Jarrell, 1994; Rasmussen, 1988; Stevens, 1984). Hawkins described an outlier as an observation that “deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980, p.1). Outliers have also been defined as values that are “dubious in the eyes of the researcher” (Dixon, 1950, p. 488) and contaminants (Wainer, 1976). Wainer (1976) also introduced the concept of the “fringelier,” referring to “unusual events which occur more often than seldom” (p. 286). These points lie near three standard deviations from the mean and hence may have a disproportionately strong influence on parameter estimates, yet are not as obvious or easily identified as ordinary outliers due to their relative proximity to the distribution center. As fringeliers are a special case of outlier, for much of the rest of the paper we will use the generic term “outlier” to refer to any single data point of dubious origin or disproportionate influence. Outliers can have deleterious effects on statistical analyses. First, they generally serve to increase error variance and reduce the power of statistical tests. Second, if non-randomly distributed they can decrease normality (and in multivariate analyses, violate assumptions of sphericity and multivariate normality), altering the odds of making both Type I and Type II errors. Third, they can seriously bias or influence estimates that may be of substantive interest (for more information on these issues, see Rasmussen, 1988; Schwager & Margolin, 1982; Zimmerman, 1994). Screening data for univariate, bivariate, and multivariate outliers is simple in these days of ubiquitous computing. The consequences of not doing so can be substantial.
Outliers can arise from several different mechanisms or causes. Anscombe (1960) sorts outliers into two major categories: those arising from errors in the data, and those arising from the inherent variability of the data. Not all outliers are illegitimate contaminants, and not all illegitimate scores show up as outliers (Barnett & Lewis, 1994). It is therefore important to consider the range of causes that may be responsible for outliers in a given data set. What should be done about an outlying data point is at least partly a function of the inferred cause.
Environmental conditions can motivate over-reporting or mis-reporting, such as if an attractive female researcher is interviewing male undergraduates about attitudes on gender equality in marriage. Depending on the details of the research, one of two things can happen: inflation of all estimates, or production of outliers. If all subjects respond the same way, the distribution will shift upward, not generally causing ouliers. However, if only a small subsample of the group responds this way to the experimenter, or if multiple researchers conduct interviews, then outliers can be created.
In other words, there is only
about a 1% chance you will get an outlying data point from a
normally-distributed population; this means that, on average, In the case that outliers occur as a function of the inherent variability of the data, opinions differ widely on what to do. Due to the deleterious effects on power, accuracy, and error rates that outliers and fringeliers can have, it might be desirable to use a transformation or recoding/truncation strategy to both keep the individual in the data set and at the same time minimize the harm to statistical inference (for more on transformations, see Osborne, 2002)
close friends. Is it
possible? Yes. Is it likely? Not generally, given any reasonable definition
of “close friends.” So this data point could represent either motivated
mis-reporting, an error of data recording or entry (it wasn’t), a protocol
error reflecting a misunderstanding of the question, or something more
interesting. This extreme score might shed light on an important principle or
issue. Before discarding outliers, researchers need to consider whether those
data contain valuable information that may not necessarily relate to the
intended study, but has importance in a more global sense.
There is as much controversy over what constitutes an outlier as whether to remove them or not. Simple rules of thumb (e.g., data points three or more standard deviations from the mean) are good starting points. Some researchers prefer visual inspection of the data. Others (e.g., Lornez, 1987) argue that outlier detection is merely a special case of the examination of data for influential data points. Simple rules such as Bivariate and multivariate outliers
are typically measured using either an index of influence or leverage, or
distance. Popular indices include Mahalanobis’ distance and Cook’s For ANOVA-type paradigms, most modern statistical software will produce a range of statistics, including standardized residuals. In ANOVA the biggest issue after screening for univariate outliers is the issue of within-cell outliers, or the distance of an individual from the subgroup. Standardized residuals represent the distance from the sub-group, and thus are effective in assisting analysts in examining data for multivariate outliers. Tabachnick and Fidell (2000) discuss data cleaning in the context of other analyses.
There is a great deal of debate as to what to do with identified outliers. A thorough review of the various arguments is not possible here. We argue that what to do depends in large part on why an outlier is in the data in the first place. Where outliers are illegitimately included in the data, it is only common sense that those data points should be removed. (see also Barnett & Lewis, 1994). Few should disagree with that statement. When the outlier is either a legitimate part of the data or the cause is unclear, the issue becomes murkier. Judd and McClelland (1989) make several strong points for removal even in these cases in order to get the most honest estimate of population parameters possible (see also Barnett & Lewis, 1994). However, not all researchers feel that way (see Orr, Sackett, & DuBois, 1991). This is a case where researchers must use their training, intuition, reasoned argument, and thoughtful consideration in making decisions.
However, transformations may not be appropriate for the model being tested, or may affect its interpretation in undesirable ways. Taking the log of a variable makes a distribution less skewed, but it also alters the relationship between the original variables in the model. For example, if the raw scores originally related to a meaningful scale, the transformed scores can be difficult to interpret (Newton & Rudestam, 1999; Osborne 2002). Also problematic is the fact that many commonly used transformations require non-negative data, which limits their applications. For this reason, many researchers turn to other methods to accommodate outlying values. One alternative to transformation is truncation, wherein extreme scores are recoded to the highest (or lowest) reasonable score. For example, a researcher might decide that in reality, it is impossible for a teenager to have more than 15 close friends. Thus, all teens reporting more than this value (even 100) would be re-coded to 15. Through truncation the relative ordering of the data is maintained, and the highest or lowest scores remain the highest or lowest scores, yet the distributional problems are reduced.
robust against the presence of outliers” (Barnett
& Lewis, 1994, p. 35). Certain parameter estimates, especially the mean
and Least Squares estimations, are particularly vulnerable to outliers, or have
“low breakdown” values. For this reason, researchers turn to robust or “high
breakdown” methods to provide alternative estimates for these important aspects
of the data.A common robust estimation method for univariate distributions involves the use of a trimmed mean, which is calculated by temporarily eliminating extreme observations at both ends of the sample (Anscombe, 1960). Alternatively, researchers may choose to compute a Windsorized mean, for which the highest and lowest observations are temporarily censored, and replaced with adjacent values from the remaining data (Barnett & Lewis, 1994). Assuming that the distribution of prediction errors is close to normal, several common robust regression techniques can help reduce the influence of outlying data points. The least trimmed squares (LTS) and the least median of squares (LMS) estimators are conceptually similar to the trimmed mean, helping to minimize the scatter of the prediction errors by eliminating a specific percentage of the largest positive and negative outliers (Rousseeuw & Leroy, 1987), while Windsorized regression smoothes the Y-data by replacing extreme residuals with the next closest value in the dataset (Lane, 2002). Many options exist for analysis of non-ideal variables. In addition to the above-mentioned options, analysts can choose from non-parametric analyses, as these types of analyses have few if any distributional assumptions, although research by Zimmerman and others (e..g, Zimmerman, 1995) do point out that even non-parametric analyses suffer from outlier cases.
The rest of this paper is devoted to a demonstration of the effects of outliers and fringeliers on the accuracy of parameter estimates, and Type I and Type II error rates. In
order to simulate a real study where a researcher samples from a particular
population, we defined our population as the 23,396 subjects in the data file
from the National Education Longitudinal Study of 1988 produced by the National Center for Educational Statistics with complete data on all variables of interest.
For the purposes of the analyses reported below, this population was sorted
into two groups: “normal” individuals whose scores on relevant variables was
between In order to simulate the normal process of sampling from a population, but standardize the proportion of outliers in each sample, one hundred samples of N=50, N=100, and N=400 each were randomly sampled (with replacement between each sampling) from the population of “normal” subjects. Then an additional 4% were randomly selected from the separate pool of outliers bringing each sample to N=52, N=104, or N=416, respectively. This procedure produced samples that could easily have been drawn at random from the full population. The following variables were calculated for each of the analyses below: *Accuracy*was assessed by checking whether the original or cleaned correlation was closer to the population correlation. In these calculations the absolute difference was examined.*Error rates*were calculated by comparing the outcome from a sample to the outcome from the population. If a particular sample yielded a different conclusion than was warranted by the population, that was considered an error of inference.
The
first example looks at simple zero-order correlations. The goal was to see the
effect of outliers on two different types of correlations: correlations close
to zero (to demonstrate the effects of outliers on Type I error rates), and
correlations that were moderately strong (to demonstrate the effects of
outliers on Type II error rates). Toward this end, two different correlations
were identified for study in the NELS data set: the correlation between locus
of control and family size ( Correlations
were then calculated in each sample, both before removal of outliers and
after. For our purposes, As Table 1 demonstrates, outliers had adverse effects upon correlations. In all cases, removal of the outliers had significant effects upon the magnitude of the correlations, and the cleaned correlations were more accurate (i.e., closer to the known population correlation) 70 - 100% of the time. Further, in most cases the incidence of errors of inference was lower with cleaned than uncleaned data.
The
second example deals with analyses that look at group mean differences, such as
t-tests and ANOVA. For the purpose of simplicity, these analyses are simple
t-tests, but these results would generalize to any ANOVA. For these analyses
two different conditions were examined: when there were no significant
differences between the groups in the population (sex differences in
socioeconomic status (SES) produced a mean group difference of 0.0007 with a SD
of 0.80 and with 24501 For
these analyses, t-tests were calculated in each sample, both before removal of
outliers, and after. For our purposes, t-tests looking at SES should not
produce significant group differences, whereas t-tests looking at mathematics
achievement test scores should. Two different issues were examined: mean
group differences and the magnitude of the The
results in Table 2 illustrate the effects of outliers on t-tests and ANOVAs.
Removal of outliers produced a significant change in the mean differences
between the two groups when the groups were equal in the population, but tended
not to when there were strong group differences. Removal of outliers produced
significant change in the
The presence of outliers in one or both cells, surprisingly, failed to produce any differential effects. The expectation had been that the presence of outliers in a single cell would increase the incidence of Type I errors. Why this effect was not shown could have to do with the type of outliers in these analyses, or other factors, such as the absolute equality of the two groups on SES, which may not reflect the situation most researchers face.
Although some authors argue that removal of extreme scores produces undesirable outcomes, they are in the minority, especially when the outliers are illegitimate. When the data points are suspected of being legitimate, some authors (e.g., Orr, Sackett, & DuBois, 1991) argue that data are more likely to be representative of the population as a whole if outliers are not removed. Conceptually, there are strong arguments for removal or alteration of outliers. The analyses reported in this paper also empirically demonstrate the benefits of outlier removal. Both correlations and t-tests tended to show significant changes in statistics as a function of removal of outliers, and in the overwhelming majority of analyses accuracy of estimates were enhanced. In most cases errors of inference were significantly reduced, a prime argument for screening and removal of outliers. Although these were two fairly simple statistical procedures, it is straightforward to argue that the benefits of data cleaning extend to simple and multiple regression, and to different types of ANOVA procedures. There are other procedures outside these, but the majority of social science research utilizes one of these procedures. Other research (e.g., Zimmerman, 1995) has dealt with the effects of extreme scores in less commonly-used procedures, such as nonparametric analyses.
Anscombe, F.J. (1960).
Rejection of outliers. Barnett, V, & Lewis, T.
(1994). Brewer, C. S., Nauenberg, E., & Osborne, J. W.
(1998, June). Dixon, W. J. (1950).
Analysis of extreme values. Evans, V.P. (1999). Strategies for detecting outliers
in regression analysis: An introductory primer. In B. Thompson (Ed.), Hamilton, L.C. (1992). Hawkins, D.M. (1980). Huck, S.W. (2000). Iglewicz, B., & Hoaglin, D.C. (1993). Jarrell, M. G. (1994). A comparison of two procedures,
the Mahalanobis Distance and the Andrews-Pregibon Statistic, for identifying
multivariate outliers. Judd, C. M., & McClelland, G. H. (1989). Lane, K. (2002, February). Lornez, F. O. (1987). Teaching about influence in
simple regression. Miller, J. (1991). Reaction time analysis with outlier
exclusion: Bias varies with sample size. Newton, R.R., &
Rudestam, K.E. (1999). Orr, J. M., Sackett, P. R., & DuBois, C. L. Z.
(1991). Outlier detection and treatment in I/O Psychology: A survey of
researcher beliefs and an empirical illustration. Osborne, J. W. (2002). Notes on the use of data
transformations. Osborne, J. W., Christiansen, W. R. I., & Gunter,
J. S. (2001). Rasmussen, J. L. (1988). Evaluating outlier
identification tests: Mahalanobis D Squared and Comrey D. Rousseeuw, P., & Leroy, A. (1987). Sachs, L. (1982). Schwager, S. J., & Margolin, B. H. (1982).
Detection of multivariate outliers. Stevens, J. P. (1984). Outliers and influential data
points in regression analysis. Tabachnick, B.G., & Fidell, L. S. (2000). Using
multivariate statistics, 4 Van Selst, M., & Jolicoeur, P. (1994). A solution
to the effect of sample size on outlier elimination. Wainer, H. (1976). Robust statistics: A survey and
some prescriptions. Zimmerman, D. W. (1994). A note on the influence of
outliers on parametric and nonparametric tests. Zimmerman, D. W. (1995). Increasing the power of
nonparametric tests by detecting and downweighting outliers. Zimmerman, D. W. (1998). Invalidation of parametric
and nonparametric statistical tests by concurrent violation of two assumptions.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Descriptors: Regression; Outliers; Residuals; Robustness |