TU-L0022 - Statistical Research Methods D, Lecture, 25.10.2022-29.3.2023
Kurssiasetusten perusteella kurssi on päättynyt 29.03.2023 Etsi kursseja: TU-L0022
Multicollinearity (19:37)
This video explains multicollinearity, another misunderstood feature of regression analysis.
Click to view transcript
Multicollinearity is another commonly misunderstood feature of regression analysis.
Multicollinearity
refers to a scenario where the independent variables are highly
correlated. It is quite common to see studies do diagnostics to detect
multicollinearity. And then drop some variables from the model based on
some statistics that indicate that multicollinearity could be a problem.
There are quite a lot of difficulties or problems with that approach.
Let's
take a look at the Hekman's paper. So they identified that the customer
race and customer gender were highly correlated with physician race and
physician gender. And therefore, they decided to drop customers' gender
and customer race from the data because that caused the
multicollinearity situation. Because these variables were correlated
with more than 0.9.
So what is this issue about and why would one
like to drop variables? Multicollinearity relates to the sampling
variance of the OLS estimate. Or generally, any estimator that estimates
a linear model.
To understand multicollinearity let's take a
look at the variance of the OLS estimates. The variance of the OLS
estimates is given by that kind of equation here and this equation is
also used for estimating the standard errors. This equation tells us
that the variance of estimates depends on how well the other independent
variables explain the focal independent variable; whose coefficients
variance we are interested in.
So when this R square here goes up
then the variance of the regression coefficient increases. The reason
is that when this R square goes up then 1- R square approaches 0 and
then when you multiply something with something that produces 0. Then
the multiplication result will approach 0 as well and when you divide
something by something that approaches zero then you will get a large
number. So when this R square increases, when each individual or the
focal variable is increasingly redundant in the model. It provides the
same information as the other variables. Then the standard error will
increase.
The R square, when our variables are more correlated
then the estimates will be less efficient and less precise. And also the
standard error will be larger because standard error estimates the
precision of the estimates. So is that a problem? Well, that depends.
Let's
take an example, what will happen when we have two highly correlated
variables and what it means for the regression analysis results. So we
should expect when two variables are highly correlated, the regression
results to be very imprecise. That if we repeat the study over and over
many times, the dispersion of the estimates over multiple repeated
samples is large.
Here we have the correlation between x and y at
0.9, which is modeled based on Hekman's paper. Let's assume that the
correlation between x and y varies between 0.43 and 0.52; so this is the
variation or this kind of dispersion could easily be a result of a
small sample. Let's assume that 0.475 is the population value with the
sample size of for example 100. It is very easy to get a sample
correlation 0.43.
Then we have a correlation between the x2 and y
modeled the same way and we have five combinations of correlations.
These correlations vary a little. Correlation between X and Y vary a
little, correlation between x2 and y vary a little. Because x1 and x2
are correlated when we calculate the regression model using these
correlations, the regression estimates actually vary widely. So in this
model, the regression coefficient is -0.2 and here it's +0.7. We have
even the sign that is flipping.
Now the multicollinearity problem
relates to the fact that because x1 and x2 are so highly correlated,
then it is very difficult to get the unique effect of x1. Because the
changes in x1 are always accompanied by changes in x2. So we don't know
which one it is. Considering the company's size and companies size in
revenue and size in personnel, those are highly correlated, note that
0.9 but still highly correlated.
So it's difficult to say
whether, for example, investment decisions depend more on the number of
people or the revenues of the company just based on statistical means.
So what's the problem? The problem here is that if we want to say that
this effect of beta 1 is 0.25 and not 0 we have to be able to
differentiate between these two correlations. And how much sample size
would we require to say for sure that correlation 0.475 instead of 0.45
or 0.5.
We have to understand the sampling variation of a
correlation. So the standard error, standard deviation of correlation of
0.475 with different sample sizes 100 is 0.05. So if our sample size is
100 then we can easily get something like 0.40 or 0.5 too which are
less than one standard deviation from this mean. We can easily get these
kinds of correlations with a sample of 100.
So when our sample
size is 100 and x and y are correlated 0.9. We really cannot say which
one of these coefficients is the correct set. Because our sample size
doesn't allow us enough precision to say which of these correlations are
the true population correlations. That determine the population
regression coefficients that we are interested in. So the fact that
these two variables are highly correlated kind of amplifies the effect
of sampling variation of this correlation. The sampling variation of
correlations here is small but because x and y are highly correlated,
that amplifies the effect on these regression coefficients.
To be
sure that the model 3 is actually correct, that's two standard
deviation difference, two standard deviations of correlation wouldn't be
enough to get us from model one to model two. We would need a sample
size of 3000.
So when variables are highly correlated, that is
referred to as multicollinearity. It refers to the correlation between
the independent variables. It has nothing to do with the dependent
variable. And it increases the sample size requirements for us to
estimate effects. And this sample size or this inflation of variance of
estimates is quantified by the variance inflation factor.
The
variance inflation factor, basically what it quantifies is how much
larger the variance of estimates is compared to a hypothetical scenario.
Where the variable would be uncorrelated with every other variable. So
the variance inflation factor is basically defined as 1 divided by 1 - R
square of the focal variable on all other independent variables. It's
this part of the model here, so when that goes to 0, then the variance
inflation factor goes to infinity. When that is exactly 1, then the
variance inflation factor is 1 which means that the multicollinearity is
not present at all in the model.
There's a rule of thumb that
many people use that the various inflation factors should not exceed 10
and if it does, we have a problem. If it doesn't we don't have a
problem. So in the previous slide, I showed you that if there's 0.9
correlation with two variables that makes it very hard to say which one
of those is the actual effect. Because they covary so strongly together.
So what is the variance inflation factor when the correlation of x1 and
x2 is 0.9? We can calculate the variance inflation factor by taking a
square of this correlation. So R square is the square of correlation.
That's 0.9 to the second power, and then we just plug the number here,
do some math, and we get a variance inflation factor of 5.26.
So
in the previous example, we would have needed 3000 observations to say
for sure that model 3 was the correct model. And not model 2 or model 4.
But variance inflation factor wouldn't detect that we have a
multicollinearity issue we had. So what does it say about this rule? It
is it's not a very useful rule.
Ketokivi and Guide, make a good
point about this rule and any rules in general in Journal of Operations
Management editorial. So this is from 2005. When Ketokivi and Guide took
over Journal of Operations Management as editors of chief and they
first published an editorial of what is the methodological standard for
this journal. And they identified some problems. And they also
identified places for improvements. So what you should not do and what
you should do and they are emphasized that you always have to
contextualize all your statistics. Like when you say the regression
coefficient is 0.2, whether it's a large effect or not depends on the
scales of both variables. And it also depends on the context.
If
you get a thousand years per year more for each additional year of
education, that's a big effect for somebody. And it's a small effect for
another person depending on where the person lives, how much the person
way it makes. So all of these statistics, the interpretation requires
context and they take aim at the variance inflation factor as well.
So
various inflation factor quantifies how much larger the variation would
be compared to if there was no multicollinearity whatsoever between the
independent variables. And they say that if your standard errors are
small from your analysis, then who cares that they could be smaller when
your independent variables will be completely independent. Which is an
unrealistic scenario anyway.
So if the standard errors indicate
that their estimates are precise, then who cares they are precise and
that's what we care. So variance inflation factor doesn't really tell us
anything useful. On the other hand, they also say that in some
scenarios, the rule of thumb that variance inflation factor must not
exceed 10 is not enough. So in the previous example, we saw that there
was 0.9 correlation corresponding to variance inflation factor of 0.5
which severely made it a lot more difficult for us to identify which one
of those models was correct. So we had a collinearity issue, it wasn't
detected by variance inflation factor.
So the various inflation
factor as Ketokivi and Guide say stating that it must exceed a cut-off
without considering the context is nonsense. So that's what they say and
I agree with that statement fully. You have to always contextualize
what does a statistic mean in your particular study.
Wooldridge
also takes some shots at various inflation factor and multicollinearity.
So this is from the fourth edition on introduction and he didn't
address multicollinearity in the first three editions of his book
because he thinks that it is not a useful concept or it's not important
enough. Regression analysis does not make any assumptions about
multicollinearity, it makes an assumption that it's independent variable
should contribute unique information.
So the variables can be
perfectly correlated. But it doesn't make any assumptions beyond that.
He decided that he's gonna take up this issue because there's so much
bad advice about multicollinearity. He says that these explanations of
multicollinearity are typically wrongheaded. People are explaining that
it is a problem. And then if you have variance inflation factor more
than 10 you have to drop variables without really explaining the
problem. And what is the consequence of dropping variables from your
model.
So let's now, let's take a look at all what it means to
solve a multicollinearity problem. So to understand the
multicollinearity problem, multicollinearity is a problem in the same
sense that the fever is a disease, it is not really a problem per se, it
is a symptom and you don't treat the symptom you treat the disease. So
if you have a child who has fever, typically cooling down the child by
putting them outside the cold temperature is not the right treatment.
You have to look at what is the cause of the multicollinearity, cause of
the fever and fix the cause instead of trying to fix the symptom.
The
typical solution for multicollinearity problems, so how do we make x1
and x2 less correlated. Well, we just drop one from the model. So let's
say we drop x2 from the model and that causes in the correct model, in
the previous example the correct model was that the effects were 0.25
both. And now if we drop x2, then the estimate of x1 will reflect the
influence of x1 and x2 both. So what will happen is that we will
overestimate the regression coefficient beta 1 by 90% and the standard
errors are smaller. So we will have a false sense of accuracy related to
this severely biased estimate.
And also if you have control
variables that are collinear with one another, that is irrelevant
because typically we just want to know how much of the variation of the
dependent variable is explained jointly, by those controllers that we're
not really interested in which one of the controls actually explained
the dependent variable.
Collinearity between the interesting
variables and the controls are important. But if you are just focusing
on controls, then it doesn't matter.
Okay so treating
collinearity as a problem is the same thing as treating fever as a
disease. So it's not a smart thing to do. We have to understand what are
the reasons why two variables are so highly correlated that we can't
really say which one is the cause of the dependent variable.
So
there are a couple of reasons why that could happen. Multicollinearity
could be happening because you have mindlessly added a lot of variables
into the model. And you shouldn't be adding mindlessly variables to
model. All variables that go to your model must be based on theory. So
just throwing hundred variables into model typically doesn't make sense.
Your models are built to test theory and then they must be driven by
theory. So what you think has a causal effect on the Y variable must go
into the model and you also must be able to explain why, what's the
mechanism that its independent variable influences the dependent
variable causally.
So that is one. You have been just mindlessly
data mining, and that's a problem. So multicollinearity is not the
problem here, the problem is that you're making stupid modeling
decisions. The second problem is that you have distinct constructs but
their measures are highly correlated and here the primary problem is not
multicollinearity but it is discriminant validity. So if two measures
of things that are supposed to be distinct are highly correlated it's a
problem of measurement validity. I'll address that in a later video.
Then
you have two measures of the same construct in the model. For example,
if you are studying the effect of a company's size, then you have
revenue and personnel both as measures of firm size in the model. That's
not a good idea to have two measures of the same thing in the model.
Let's
take an extreme example, let's assume that we want to study the effect
of a person's height on a person's weight and we have two measures of
height. We have centimetres and inches. It doesn't make any sense to try
to get the effect of inches independent of the effect of size, in fact,
that can't even be estimated.
So if you study, if you have
multiple measures of the same thing then typically you should first
combine those multiple measures in the single composite measure. I'll
cover that later on.
Then the final case is that you are really
interested in two closely related constructs. And their distinct
effects. For example, you want to know whether a person's age or a
person's tenure influences the customer satisfaction scores. That the
doctors give to the patients like in Hekman's study. Then you really
cannot drop either one of those. You can't say that because tenure and
age are highly correlated. We are just gonna use omit tenure and assume
that all correlation between age and customer satisfaction is due to the
age only and tenure doesn't have an effect. So that is not the right
choice. Instead, you have to just increase the sample size. So that you
can answer your complicated risk or complex research question in a
precise manner.