TU-L0022 - Statistical Research Methods D, Lecture, 25.10.2022-29.3.2023
Kurssiasetusten perusteella kurssi on päättynyt 29.03.2023 Etsi kursseja: TU-L0022
Regression diagnostics and analysis workflow (17:47)
This video introduces one possible workflow for regression analysis.
Click to view transcript
In this video, I will show you one possible workflow for
regression analysis. This workflow will address all the assumptions that
are empirically testable after regression analysis. There are, of
course, multiple different ways of testing assumptions. But this is the
way I like to do it. I'm using R for this example. All these tests and
diagnostics can be done with Stata as well. And most of them can be done
with SPSS. Regression
analysis workflow and any other statistical analysis workflow first
start by stating a hypothesis that we want to test, then we collect some
data for testing the hypothesis. After that, we explore data, so it is
important to understand the relationships. Then we estimate the first
regression model, where we have the independent variables and the
dependent variable. Then we check the results briefly, to see what
they're like. And we proceed with diagnostics. The
diagnostics include various plots, and I prefer plots over statistical
tests. The reason is that while you can, for example, do a test for
heteroskedasticity. That test will only tell you whether there's a
problem or not, it will not tell you the nature of the problem. It is
much more informative to look at the actual distribution of the
residuals to see what is the heteroskedasticity problem like. And, if
you just look or eyeball these graphs, you will basically identify the
same thing that the test tells you. I don't generally use tests unless
someone asked me to do so. Then
when I have done the diagnostics, I figure out what is the biggest
problem. And once I have fixed the biggest problem, then I go back to do
a regression model. For example, I may identify that there are some
nonlinear relationships that I didn't think of in advance, or I may
identify some outliers, or I may identify some heteroskedasticity, I go
back to fit another regression model, where I have fixed the problem,
then I do diagnostics again. And once I'm happy, then I conclude that
that is my final model after the diagnostics, I do possibly nested model
tests against alternative models. And
then comes the fun part, I interpret what the regression coefficients
mean. I don't just state that there is some coefficient of 0.02. I tell
what it means in my research context. And that is the hard part of
regression analysis. To demonstrate the regression analysis,
diagnostics, reading some data, we are going to be using the prestige
dataset again. And our dependent variable is prestige this time, and
we're going to be using education income and the share of women as
independent variables. That is a regression model. And the regression
estimates are here, we have gone through these estimates before in a
previous video, so I will not explain them in detail. Instead,
I'm going to be focusing now on the assumptions checking. So how do we
know that the six regression assumptions hold? The assumptions are shown
here, the assumptions are that all relationships are linear. It's a
linear model, observations are independent. Independence of observation
comes from our research design. And the cross-sectional study is
difficult to test. If you have a longitudinal study, then you can do
some checks for the independence of observations. No perfect
collinearity and non-zero variances of independent variables. What
happens if two or more variables perfectly determine one another. If you
have a categorical variable of three categories, then including three
dummies leads to this problem, because once you know two dummies, you
know the third value. Also, non-zero variance, if you have zero
variance, for example, if you are studying the effects of gender, and
you have no women in the sample, then you have no variance in gender. So
that is another implication. Another
reason why this could occur. We know that this is not a problem in our
data. Because if it was a problem, we couldn't even estimate the
regression model, because we got regression estimates that indicate that
we don't have a problem with the third assumption. The other
assumptions are a bit more problematic because they are about the error
term, and we can't observe their term. The
fourth assumption was the term expected value of zero given any values
of independent variables, then error term has an equal variance, this is
the homoskedasticity assumption and then the error term is normally
distributed. How we test these assumptions about the error term, these
three assumptions is that we use the residuals as estimates of the error
term. If observation is far from the regression line in the population,
there's a large value of the error term, then we can expect that it
also has a large residual. We can use the residuals as estimates of
error terms. So normally doing regression in diagnostics is analyzing
the residuals. And that's quite natural. Because if you think that
residual is a part of the data that the model doesn't explain, and the
idea of diagnostics is that we check if the model explains the data
adequately, then it's quite natural to look at the part of the data, the
model doesn't explain for clues of what could go wrong. I
normally start with, the normal Q-Q plot of the residuals. And the
normal Q-Q plot is something that quantifies whether the regression and
residuals are normally distributed. It compares, the residuals here, or
these calculated based on standardized residuals, there are different
kinds of residuals, for an applied researcher, it doesn't really matter
if we know them all. What's important is that your software will
calculate the right kind of residual for you automatically. When you do
these plots. Then you have normal distributions, we're comparing
residuals against normal distribution, we can see here that they,
roughly correspond. We have a line here that indicates that residuals
are normally distributed. Here's the problem are we have a chi-square
distributed or two here. The residuals here, are further from the mean
than they're supposed to be. And here we have an inverse, we have a
uniform distribution of the error term. And that creates this kind of S
shape in the normal Q-Q plot. While the normality of the error term is
not an important assumption in regression analysis, I nevertheless do
this because it usually, it’s quick to do it identifies outliers for me,
and it gives me a kind of like, a first look at the data. Herewith
the actual data, I can see that the residuals follow a normal
distribution. I'm happy with this, this is an indication of a good
fitting model if we think they are the sixth assumption. R labels these
possible outliers. Newsboys have a large negative residual. Newsboys are
less prestigious than the model predicts. And farmers are more
prestigious than the model predicts. Farmers don't make much money. And
you don't need high education to be a farmer. But farmers are still
appreciated a lot. That's another extreme case. The normal QQ plot shows
that the residuals are roughly normally distributed, and that's a good
thing. We conclude no problems, then we start looking at more
complicated plots. The next plot is
the residuals versus fitted plot. And the idea of residual versus
fitted plot is that it allows us to test for nonlinearities and
heteroskedastic in the data. The fitted value is calculated based on the
regression equation. We multiply these are variables with the
regression coefficients, and then we compare residual versus fitted.
Ideally, there is no pattern here, the residuals and fitted values, are
just spread out. This is an indication of a well-fitting model. In this
regard, here we have a heteroskedasticity problem. That
plot contains data where the variation of the residual, and there is an
error term, he saw a lot less here in the middle, and then it opens to
the left and to the right. This is a butterfly shape of residuals. And
this is the worst kind of heteroskedastic problem that you could have.
But it's not very, very realistic, because it's difficult to think of
what kind of process will generate this kind of data. Then here, we have
nonlinearity and some heteroskedasticity problems. This is a megaphone
opening, right? And it appears that there's slight nonlinearity here, we
have here severe nonlinearity. The right formula, the right shape is
not a line, but it's a curve here. And this is a weird-looking dataset
that has a nonlinearity problem. And, it has a heteroskedasticity
problem. In the plot, we want to
have something that looks like no pattern. Typically, in these
diagnostic plots, that plot residual against something else, you are
looking for an old pattern. Our residual versus fitted plot looks like
that. We have marked again, these observations with high residuals in
absolute value. And then we can see that we have fitted values, there
are very few or a few professions for which the model predicts high
prestigiousness. And most observations are between 30 and 70. What can
we infer from this plot, we can infer that maybe the variance of the
estimates decreases slightly to the right. We don't have many
observations here. We don't know if this is actually the same dispersion
here, but we just observe two values from that dispersion. But it is
possible that, if you look at, this person here that much and look at
this person here, it’s slightly less, so we may have a
heteroskedasticity problem. The
fifth assumption does not hold whether that is severe enough to warrant
using the heteroskedastic as this is a robust, standard error. That is a
bit unclear because this is not a clear case of where we should use
those. Then we check for outliers. So far, we have been looking for
evidence for heteroskedastic and nonlinearity. We have found evidence
for heteroskedasticity, but not really for nonlinearities. Then we are
looking for outliers as the final step using the fourth plot. And the
residual versus leverage plot, tell us which observations are
influenced. We're looking here at observations that have high leverage
and high residual. We have general managers who have high leverage and a
high residual in absolute value. We want to look for observations, with
residuals that are large in absolute value, absolute magnitude. In
Stata, for example, Stata uses the squared residual here, because that
always goes up. It's easier to see which observations have large
residuals, so we can, must look at small negative values, or large
positive values here. It's not as simple as it was if this was a square
of the residual. Minister has leverage, newsboys have a large residual,
then general managers are here. The cooks’ distance is another measure
of influence and observations with large, cooks’ distance are potential
outliers. As before, in the Deephouse paper to deal with these outliers,
we will be looking at why the prestigiousness of one occupation would
be different than others. So for
example, general managers, earn a lot of money, so their salaries are
high. And therefore their predictive prestigiousness should be high as
well because it depends on the income. And they earn less than what the
model predicts, which means that the model over-predicts their
prestigiousness because of their high income. That could be one reason
why we could drop general managers, but you have to use your own
judgment because this is 102 observations. Dropping one observation
increases our sample size by 1%, approximately. So that could be
consequential. We have the leverage
the distance from the mass center of the data, conceptually, and cooks’
distance is another measure of influence. We identify outliers using
this plot, then we start looking at the final plot, which is the
added-variable plot. Added-variable plot quantifies the relationship
between the dependent variable and one independent variable at a time.
And this plot is interesting, it tells us plots, education, that is the
focal independent variable regressed on the other independent variables
here, the others, and it takes the residual. This is the part of
education that is not explained by the income or share of women. That's
if you think about the Venn diagram, presentation of regression
analysis, this is the part of the education that does not overlap with
any of the other independent variables. Then
we have prestige, the regression of prestige on other independent
variables and we take the residual. We take what is unique of prestige,
and unique education after parceling out the influence of all other
variables in the model. And then we draw a line through that beta. And
this is the regression line of prestige on education. One way to
calculate the regression line is to regress both variables independent
and dependent on all other independent variables. And then run the
regression analysis using just one independent variable. It produces the
exact same result, as would producing, including this education with
all the other variables directly in multiple regression analysis. This
plot allows us to look for nonlinearities and heteroskedastic in a more
refined manner. What we can identify from here is that the effects of
income look pretty weird. We want to have observations that are banded
as a band around the regression line. And here you can see that it looks
more like a bit of a curve, it goes up here and then flattens out a
bit. And we also have much more dispersion here than dispersion here. Now,
we have done the diagnostics. We did all the normal Q-Q plots, then we
did the residual versus fitted plot, we did the are influence plot or
the outlier plot, and added the variable plot. And now we must decide
what do we want to do with the model. And some ideas that we could try
are to use heteroskedasticity robust standard errors, our sample size is
so small, and there is no clear evidence of a serious
heteroskedasticity problem. In this case, I would probably use the
normal conventional standard errors, consider dropping general managers
and see if the results change. Even if we decide to keep general
managers in our sample, that could work as a robustness check in the
paper. In Deephouse's paper, they
estimated the same model with the one outlier observation and without
the outlier and then compare the results. And we should consider log
transformation of income, consider the income in relative terms makes a
lot more sense anyway. Because when you think of all races, for example,
or you want to switch to a new job, then you typically want to
negotiate a salary increase relative to your current level. Also,
additional salary, how much it increases your quality of life depends on
the current salary level. If you give 1000 euros to somebody who makes
1000 euros per month, that's a big difference. If you give 1000 euros to
somebody who makes 5000 euros a month, it's a smaller difference.
Income company revenues, that kind of quantities we typically want to
consider in relative terms. And to do that we use the log transformers.