TU-L0022_aalto-CUR-141790-3063741: Outliers (5:55)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Outliers (5:55)

Vaatii arvosanan

This video is about outliers, detecting the reason of outliers, and outliers effect. It also tells what to do with outliers.

Click to view transcript

Another feature that people typically check from their data is the presence of outliers. Outliers are influential observations or observations that are different from other observations. While outliers are not an assumption or lack of outliers is not an assumption in regression analysis. There are reasons sometimes to delete them. So you have to understand why you have an outlier.

Let's take a look at outliers. So, here we have that prestige data set. We have a regression line of the effect of education on prestige and it's a clean nice regression line. The observations are homoscedastic and they are spread evenly on the regression line. And there are no problems. What happens if we have one observation that is very far apart from others. We have an outlier here. So, what will that outlier do.

The outlier will pull the regression line toward itself and now with the outlier including the data, regression line goes the slope is a bit less. And also it no longer goes through the middle of the remaining observations rather it goes kind of like too low here and too high here. So the outlier clearly, we don't want to have it here. But before we decide what to do with the outlier we have to consider the different mechanisms. So what is this observation really about. And it could be that it's a data entry mistake. So the occupations prestigiousness really should be 70 but somebody wrote 17 to our dataset. Or it is possible that this is an outlier if these were companies it could be a company that is outside of our population. If we do a survey of small technology companies then we can accidentally send the survey to a large technology company. And the large technology company would be outside of our population so it's not part of our sample or our population. Or it could be a case that is very unique. If we're studying the growth of small technology-based companies, then for example Supercell Finnish game developer that makes billions of euros of revenue on games on App Store is an outlier. Because while they technically are a small and young technology-based company, they are so different from other companies in their performance, that using that company when our regression model typically wants to explain the bulk of the data, so where most of the observations are, then including that particular outlier is something that we probably don't want to do.

So outliers are either, they could be observations that are truly unique, they could be worth studying separately as case studies, they could be data entry mistakes and or they could be observations that don't belong to our population,and we're including the sample accidentally. The effects of outlier depend on two different things. So we have first the residuals. How far the outlier is from the regression line? The outlier pulls the regression line toward itself and the strength of or the force is are related to the residual. So we want to minimize the sum of squared residuals. If one observation is very large residuals then it pulls very strongly the regression line because it's the square of the residual that matters. Another concept is the leverage so if we are pulling the regression line here, where there are few observations, then we have a lot more leverage and the regression line moves more, than if we pull it from the middle here where there are lots of observations. So pulling the regression line from here has zero leverage and the outlier wouldn't really matter. So, we check at leverage and residual when we do outlier diagnostics.

When we identify outliers there are three important steps in the process. And Deephouse's article is a really great example of how you deal with outliers. First you report how did you identify the outliers and Deephouse used residuals. They identified companies or banks with large residuals, then they analyzed the outliers. So what is the outlier like, is it the data entry mistake, is it a company that shouldn't be in the sample,
or is it a unique case that is not representative of the other banks, even if it belongs technically to population. They identified that there were two banks that were merging. And if you have banks that are merging then that is probably quite different observation than others. And they decided to drop that observation from the sample. So that's the third step. Explain what you did and what was the outcome of doing so. They explained that what was the effect of dropping the outlier, and they conclude that it didn't really make a difference of whether they include that observation as a sample or not. And that's a very good example.

If you want read more about outliers and good practices, I recommend this paper by Aguinis and his students. They write how you identify outliers in regression analysis, structure regression models and multi-level models and what you can deal, how do you can deal with the outliers. Sometimes outliers are problematic, sometimes there are data entry mistakes which can be fixed. Sometimes outliers are truly interesting cases that you should study separately. Yeah so that's what the Deephouse paper did.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Outliers (5:55)

Opiskelijoille

Opettajille

Palvelusta