TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Outliers (5:55)
This video is about outliers, detecting the reason of outliers, and outliers effect. It also tells what to do with outliers.
Click to view transcript
Another
feature that people typically check from their data is the presence of
outliers. Outliers are influential observations or observations that are
different from other observations. While outliers are not an assumption
or lack of outliers is not an assumption in regression analysis. There
are reasons sometimes to delete them. So you have to understand why you
have an outlier.
Let's take a look at outliers. So, here we have
that prestige data set. We have a regression line of the effect of
education on prestige and it's a clean nice regression line. The
observations are homoscedastic and they are spread evenly on the
regression line. And there are no problems. What happens if we have one
observation that is very far apart from others. We have an outlier here.
So, what will that outlier do.
The outlier will pull the
regression line toward itself and now with the outlier including the
data, regression line goes the slope is a bit less. And also it no
longer goes through the middle of the remaining observations rather it
goes kind of like too low here and too high here. So the outlier
clearly, we don't want to have it here. But before we decide what to do
with the outlier we have to consider the different mechanisms. So what
is this observation really about. And it could be that it's a data entry
mistake. So the occupations prestigiousness really should be 70 but
somebody wrote 17 to our dataset. Or it is possible that this is an
outlier if these were companies it could be a company that is outside of
our population. If we do a survey of small technology companies then we
can accidentally send the survey to a large technology company. And the
large technology company would be outside of our population so it's not
part of our sample or our population. Or it could be a case that is
very unique. If we're studying the growth of small technology-based
companies, then for example Supercell Finnish game developer that makes
billions of euros of revenue on games on App Store is an outlier.
Because while they technically are a small and young technology-based
company, they are so different from other companies in their
performance, that using that company when our regression model typically
wants to explain the bulk of the data, so where most of the
observations are, then including that particular outlier is something
that we probably don't want to do.
So outliers are either, they
could be observations that are truly unique, they could be worth
studying separately as case studies, they could be data entry mistakes
and or they could be observations that don't belong to our
population,and we're including the sample accidentally. The effects of
outlier depend on two different things. So we have first the residuals.
How far the outlier is from the regression line? The outlier pulls the
regression line toward itself and the strength of or the force is are
related to the residual. So we want to minimize the sum of squared
residuals. If one observation is very large residuals then it pulls very
strongly the regression line because it's the square of the residual
that matters. Another concept is the leverage so if we are pulling the
regression line here, where there are few observations, then we have a
lot more leverage and the regression line moves more, than if we pull it
from the middle here where there are lots of observations. So pulling
the regression line from here has zero leverage and the outlier wouldn't
really matter. So, we check at leverage and residual when we do outlier
diagnostics.
When we identify outliers there are three important
steps in the process. And Deephouse's article is a really great example
of how you deal with outliers. First you report how did you identify
the outliers and Deephouse used residuals. They identified companies or
banks with large residuals, then they analyzed the outliers. So what is
the outlier like, is it the data entry mistake, is it a company that
shouldn't be in the sample,
or is it a unique case that is not
representative of the other banks, even if it belongs technically to
population. They identified that there were two banks that were merging.
And if you have banks that are merging then that is probably quite
different observation than others. And they decided to drop that
observation from the sample. So that's the third step. Explain what you
did and what was the outcome of doing so. They explained that what was
the effect of dropping the outlier, and they conclude that it didn't
really make a difference of whether they include that observation as a
sample or not. And that's a very good example.
If you want read
more about outliers and good practices, I recommend this paper by
Aguinis and his students. They write how you identify outliers in
regression analysis, structure regression models and multi-level models
and what you can deal, how do you can deal with the outliers. Sometimes
outliers are problematic, sometimes there are data entry mistakes which
can be fixed. Sometimes outliers are truly interesting cases that you
should study separately. Yeah so that's what the Deephouse paper did.