TU-L0022_aalto-CUR-141790-3063741: Regression diagnostics and analysis workflow (17:47)

Home Schools Course feedback Service Links Intelliboard

This course space end date is set to 06.04.2022 Search Courses: TU-L0022

Regression diagnostics and analysis workflow (17:47)

Receive a grade

This video introduces one possible workflow for regression analysis.

Click to view transcript

In this video, I will show you one possible workflow for regression analysis. This workflow will address all the assumptions that are empirically testable after regression analysis. There are, of course, multiple different ways of testing assumptions. But this is the way I like to do it. I'm using R for this example. All these tests and diagnostics can be done with Stata as well. And most of them can be done with SPSS.

Regression analysis workflow and any other statistical analysis workflow first start by stating a hypothesis that we want to test, then we collect some data for testing the hypothesis. After that, we explore data, so it is important to understand the relationships. Then we estimate the first regression model, where we have the independent variables and the dependent variable. Then we check the results briefly, to see what they're like. And we proceed with diagnostics.

The diagnostics include various plots, and I prefer plots over statistical tests. The reason is that while you can, for example, do a test for heteroskedasticity. That test will only tell you whether there's a problem or not, it will not tell you the nature of the problem. It is much more informative to look at the actual distribution of the residuals to see what is the heteroskedasticity problem like. And, if you just look or eyeball these graphs, you will basically identify the same thing that the test tells you. I don't generally use tests unless someone asked me to do so.

Then when I have done the diagnostics, I figure out what is the biggest problem. And once I have fixed the biggest problem, then I go back to do a regression model. For example, I may identify that there are some nonlinear relationships that I didn't think of in advance, or I may identify some outliers, or I may identify some heteroskedasticity, I go back to fit another regression model, where I have fixed the problem, then I do diagnostics again. And once I'm happy, then I conclude that that is my final model after the diagnostics, I do possibly nested model tests against alternative models.

And then comes the fun part, I interpret what the regression coefficients mean. I don't just state that there is some coefficient of 0.02. I tell what it means in my research context. And that is the hard part of regression analysis. To demonstrate the regression analysis, diagnostics, reading some data, we are going to be using the prestige dataset again. And our dependent variable is prestige this time, and we're going to be using education income and the share of women as independent variables. That is a regression model. And the regression estimates are here, we have gone through these estimates before in a previous video, so I will not explain them in detail.

Instead, I'm going to be focusing now on the assumptions checking. So how do we know that the six regression assumptions hold? The assumptions are shown here, the assumptions are that all relationships are linear. It's a linear model, observations are independent. Independence of observation comes from our research design. And the cross-sectional study is difficult to test. If you have a longitudinal study, then you can do some checks for the independence of observations. No perfect collinearity and non-zero variances of independent variables. What happens if two or more variables perfectly determine one another. If you have a categorical variable of three categories, then including three dummies leads to this problem, because once you know two dummies, you know the third value. Also, non-zero variance, if you have zero variance, for example, if you are studying the effects of gender, and you have no women in the sample, then you have no variance in gender. So that is another implication.

Another reason why this could occur. We know that this is not a problem in our data. Because if it was a problem, we couldn't even estimate the regression model, because we got regression estimates that indicate that we don't have a problem with the third assumption. The other assumptions are a bit more problematic because they are about the error term, and we can't observe their term.

The fourth assumption was the term expected value of zero given any values of independent variables, then error term has an equal variance, this is the homoskedasticity assumption and then the error term is normally distributed. How we test these assumptions about the error term, these three assumptions is that we use the residuals as estimates of the error term. If observation is far from the regression line in the population, there's a large value of the error term, then we can expect that it also has a large residual. We can use the residuals as estimates of error terms. So normally doing regression in diagnostics is analyzing the residuals. And that's quite natural. Because if you think that residual is a part of the data that the model doesn't explain, and the idea of diagnostics is that we check if the model explains the data adequately, then it's quite natural to look at the part of the data, the model doesn't explain for clues of what could go wrong.

I normally start with, the normal Q-Q plot of the residuals. And the normal Q-Q plot is something that quantifies whether the regression and residuals are normally distributed. It compares, the residuals here, or these calculated based on standardized residuals, there are different kinds of residuals, for an applied researcher, it doesn't really matter if we know them all. What's important is that your software will calculate the right kind of residual for you automatically. When you do these plots. Then you have normal distributions, we're comparing residuals against normal distribution, we can see here that they, roughly correspond. We have a line here that indicates that residuals are normally distributed. Here's the problem are we have a chi-square distributed or two here. The residuals here, are further from the mean than they're supposed to be. And here we have an inverse, we have a uniform distribution of the error term. And that creates this kind of S shape in the normal Q-Q plot. While the normality of the error term is not an important assumption in regression analysis, I nevertheless do this because it usually, it’s quick to do it identifies outliers for me, and it gives me a kind of like, a first look at the data.

Herewith the actual data, I can see that the residuals follow a normal distribution. I'm happy with this, this is an indication of a good fitting model if we think they are the sixth assumption. R labels these possible outliers. Newsboys have a large negative residual. Newsboys are less prestigious than the model predicts. And farmers are more prestigious than the model predicts. Farmers don't make much money. And you don't need high education to be a farmer. But farmers are still appreciated a lot. That's another extreme case. The normal QQ plot shows that the residuals are roughly normally distributed, and that's a good thing. We conclude no problems, then we start looking at more complicated plots.

The next plot is the residuals versus fitted plot. And the idea of residual versus fitted plot is that it allows us to test for nonlinearities and heteroskedastic in the data. The fitted value is calculated based on the regression equation. We multiply these are variables with the regression coefficients, and then we compare residual versus fitted. Ideally, there is no pattern here, the residuals and fitted values, are just spread out. This is an indication of a well-fitting model. In this regard, here we have a heteroskedasticity problem.

That plot contains data where the variation of the residual, and there is an error term, he saw a lot less here in the middle, and then it opens to the left and to the right. This is a butterfly shape of residuals. And this is the worst kind of heteroskedastic problem that you could have. But it's not very, very realistic, because it's difficult to think of what kind of process will generate this kind of data. Then here, we have nonlinearity and some heteroskedasticity problems. This is a megaphone opening, right? And it appears that there's slight nonlinearity here, we have here severe nonlinearity. The right formula, the right shape is not a line, but it's a curve here. And this is a weird-looking dataset that has a nonlinearity problem. And, it has a heteroskedasticity problem.

In the plot, we want to have something that looks like no pattern. Typically, in these diagnostic plots, that plot residual against something else, you are looking for an old pattern. Our residual versus fitted plot looks like that. We have marked again, these observations with high residuals in absolute value. And then we can see that we have fitted values, there are very few or a few professions for which the model predicts high prestigiousness. And most observations are between 30 and 70. What can we infer from this plot, we can infer that maybe the variance of the estimates decreases slightly to the right. We don't have many observations here. We don't know if this is actually the same dispersion here, but we just observe two values from that dispersion. But it is possible that, if you look at, this person here that much and look at this person here, it’s slightly less, so we may have a heteroskedasticity problem.

The fifth assumption does not hold whether that is severe enough to warrant using the heteroskedastic as this is a robust, standard error. That is a bit unclear because this is not a clear case of where we should use those. Then we check for outliers. So far, we have been looking for evidence for heteroskedastic and nonlinearity. We have found evidence for heteroskedasticity, but not really for nonlinearities. Then we are looking for outliers as the final step using the fourth plot. And the residual versus leverage plot, tell us which observations are influenced. We're looking here at observations that have high leverage and high residual. We have general managers who have high leverage and a high residual in absolute value. We want to look for observations, with residuals that are large in absolute value, absolute magnitude.

In Stata, for example, Stata uses the squared residual here, because that always goes up. It's easier to see which observations have large residuals, so we can, must look at small negative values, or large positive values here. It's not as simple as it was if this was a square of the residual. Minister has leverage, newsboys have a large residual, then general managers are here. The cooks’ distance is another measure of influence and observations with large, cooks’ distance are potential outliers. As before, in the Deephouse paper to deal with these outliers, we will be looking at why the prestigiousness of one occupation would be different than others.

So for example, general managers, earn a lot of money, so their salaries are high. And therefore their predictive prestigiousness should be high as well because it depends on the income. And they earn less than what the model predicts, which means that the model over-predicts their prestigiousness because of their high income. That could be one reason why we could drop general managers, but you have to use your own judgment because this is 102 observations. Dropping one observation increases our sample size by 1%, approximately. So that could be consequential.

We have the leverage the distance from the mass center of the data, conceptually, and cooks’ distance is another measure of influence. We identify outliers using this plot, then we start looking at the final plot, which is the added-variable plot. Added-variable plot quantifies the relationship between the dependent variable and one independent variable at a time. And this plot is interesting, it tells us plots, education, that is the focal independent variable regressed on the other independent variables here, the others, and it takes the residual. This is the part of education that is not explained by the income or share of women. That's if you think about the Venn diagram, presentation of regression analysis, this is the part of the education that does not overlap with any of the other independent variables.

Then we have prestige, the regression of prestige on other independent variables and we take the residual. We take what is unique of prestige, and unique education after parceling out the influence of all other variables in the model. And then we draw a line through that beta. And this is the regression line of prestige on education. One way to calculate the regression line is to regress both variables independent and dependent on all other independent variables. And then run the regression analysis using just one independent variable. It produces the exact same result, as would producing, including this education with all the other variables directly in multiple regression analysis. This plot allows us to look for nonlinearities and heteroskedastic in a more refined manner. What we can identify from here is that the effects of income look pretty weird. We want to have observations that are banded as a band around the regression line. And here you can see that it looks more like a bit of a curve, it goes up here and then flattens out a bit. And we also have much more dispersion here than dispersion here.

Now, we have done the diagnostics. We did all the normal Q-Q plots, then we did the residual versus fitted plot, we did the are influence plot or the outlier plot, and added the variable plot. And now we must decide what do we want to do with the model. And some ideas that we could try are to use heteroskedasticity robust standard errors, our sample size is so small, and there is no clear evidence of a serious heteroskedasticity problem. In this case, I would probably use the normal conventional standard errors, consider dropping general managers and see if the results change. Even if we decide to keep general managers in our sample, that could work as a robustness check in the paper.

In Deephouse's paper, they estimated the same model with the one outlier observation and without the outlier and then compare the results. And we should consider log transformation of income, consider the income in relative terms makes a lot more sense anyway. Because when you think of all races, for example, or you want to switch to a new job, then you typically want to negotiate a salary increase relative to your current level. Also, additional salary, how much it increases your quality of life depends on the current salary level. If you give 1000 euros to somebody who makes 1000 euros per month, that's a big difference. If you give 1000 euros to somebody who makes 5000 euros a month, it's a smaller difference. Income company revenues, that kind of quantities we typically want to consider in relative terms. And to do that we use the log transformers.

You are in preview mode.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Regression diagnostics and analysis workflow (17:47)

Students

Teachers

About service