TU-L0022_aalto-CUR-141790-3063741: Multicollinearity (19:37)

Home Schools Course feedback Service Links Intelliboard

This course space end date is set to 06.04.2022 Search Courses: TU-L0022

Multicollinearity (19:37)

Receive a grade

This video explains multicollinearity, another misunderstood feature of regression analysis.

Click to view transcript

Multicollinearity is another commonly misunderstood feature of regression analysis.

Multicollinearity refers to a scenario where the independent variables are highly correlated. It is quite common to see studies do diagnostics to detect multicollinearity. And then drop some variables from the model based on some statistics that indicate that multicollinearity could be a problem. There are quite a lot of difficulties or problems with that approach.

Let's take a look at the Hekman's paper. So they identified that the customer race and customer gender were highly correlated with physician race and physician gender. And therefore, they decided to drop customers' gender and customer race from the data because that caused the multicollinearity situation. Because these variables were correlated with more than 0.9.

So what is this issue about and why would one like to drop variables? Multicollinearity relates to the sampling variance of the OLS estimate. Or generally, any estimator that estimates a linear model.

To understand multicollinearity let's take a look at the variance of the OLS estimates. The variance of the OLS estimates is given by that kind of equation here and this equation is also used for estimating the standard errors. This equation tells us that the variance of estimates depends on how well the other independent variables explain the focal independent variable; whose coefficients variance we are interested in.

So when this R square here goes up then the variance of the regression coefficient increases. The reason is that when this R square goes up then 1- R square approaches 0 and then when you multiply something with something that produces 0. Then the multiplication result will approach 0 as well and when you divide something by something that approaches zero then you will get a large number. So when this R square increases, when each individual or the focal variable is increasingly redundant in the model. It provides the same information as the other variables. Then the standard error will increase.

The R square, when our variables are more correlated then the estimates will be less efficient and less precise. And also the standard error will be larger because standard error estimates the precision of the estimates. So is that a problem? Well, that depends.

Let's take an example, what will happen when we have two highly correlated variables and what it means for the regression analysis results. So we should expect when two variables are highly correlated, the regression results to be very imprecise. That if we repeat the study over and over many times, the dispersion of the estimates over multiple repeated samples is large.

Here we have the correlation between x and y at 0.9, which is modeled based on Hekman's paper. Let's assume that the correlation between x and y varies between 0.43 and 0.52; so this is the variation or this kind of dispersion could easily be a result of a small sample. Let's assume that 0.475 is the population value with the sample size of for example 100. It is very easy to get a sample correlation 0.43.

Then we have a correlation between the x2 and y modeled the same way and we have five combinations of correlations. These correlations vary a little. Correlation between X and Y vary a little, correlation between x2 and y vary a little. Because x1 and x2 are correlated when we calculate the regression model using these correlations, the regression estimates actually vary widely. So in this model, the regression coefficient is -0.2 and here it's +0.7. We have even the sign that is flipping.

Now the multicollinearity problem relates to the fact that because x1 and x2 are so highly correlated, then it is very difficult to get the unique effect of x1. Because the changes in x1 are always accompanied by changes in x2. So we don't know which one it is. Considering the company's size and companies size in revenue and size in personnel, those are highly correlated, note that 0.9 but still highly correlated.

So it's difficult to say whether, for example, investment decisions depend more on the number of people or the revenues of the company just based on statistical means. So what's the problem? The problem here is that if we want to say that this effect of beta 1 is 0.25 and not 0 we have to be able to differentiate between these two correlations. And how much sample size would we require to say for sure that correlation 0.475 instead of 0.45 or 0.5.

We have to understand the sampling variation of a correlation. So the standard error, standard deviation of correlation of 0.475 with different sample sizes 100 is 0.05. So if our sample size is 100 then we can easily get something like 0.40 or 0.5 too which are less than one standard deviation from this mean. We can easily get these kinds of correlations with a sample of 100.

So when our sample size is 100 and x and y are correlated 0.9. We really cannot say which one of these coefficients is the correct set. Because our sample size doesn't allow us enough precision to say which of these correlations are the true population correlations. That determine the population regression coefficients that we are interested in. So the fact that these two variables are highly correlated kind of amplifies the effect of sampling variation of this correlation. The sampling variation of correlations here is small but because x and y are highly correlated, that amplifies the effect on these regression coefficients.

To be sure that the model 3 is actually correct, that's two standard deviation difference, two standard deviations of correlation wouldn't be enough to get us from model one to model two. We would need a sample size of 3000.

So when variables are highly correlated, that is referred to as multicollinearity. It refers to the correlation between the independent variables. It has nothing to do with the dependent variable. And it increases the sample size requirements for us to estimate effects. And this sample size or this inflation of variance of estimates is quantified by the variance inflation factor.

The variance inflation factor, basically what it quantifies is how much larger the variance of estimates is compared to a hypothetical scenario. Where the variable would be uncorrelated with every other variable. So the variance inflation factor is basically defined as 1 divided by 1 - R square of the focal variable on all other independent variables. It's this part of the model here, so when that goes to 0, then the variance inflation factor goes to infinity. When that is exactly 1, then the variance inflation factor is 1 which means that the multicollinearity is not present at all in the model.

There's a rule of thumb that many people use that the various inflation factors should not exceed 10 and if it does, we have a problem. If it doesn't we don't have a problem. So in the previous slide, I showed you that if there's 0.9 correlation with two variables that makes it very hard to say which one of those is the actual effect. Because they covary so strongly together. So what is the variance inflation factor when the correlation of x1 and x2 is 0.9? We can calculate the variance inflation factor by taking a square of this correlation. So R square is the square of correlation. That's 0.9 to the second power, and then we just plug the number here, do some math, and we get a variance inflation factor of 5.26.

So in the previous example, we would have needed 3000 observations to say for sure that model 3 was the correct model. And not model 2 or model 4. But variance inflation factor wouldn't detect that we have a multicollinearity issue we had. So what does it say about this rule? It is it's not a very useful rule.

Ketokivi and Guide, make a good point about this rule and any rules in general in Journal of Operations Management editorial. So this is from 2005. When Ketokivi and Guide took over Journal of Operations Management as editors of chief and they first published an editorial of what is the methodological standard for this journal. And they identified some problems. And they also identified places for improvements. So what you should not do and what you should do and they are emphasized that you always have to contextualize all your statistics. Like when you say the regression coefficient is 0.2, whether it's a large effect or not depends on the scales of both variables. And it also depends on the context.

If you get a thousand years per year more for each additional year of education, that's a big effect for somebody. And it's a small effect for another person depending on where the person lives, how much the person way it makes. So all of these statistics, the interpretation requires context and they take aim at the variance inflation factor as well.

So various inflation factor quantifies how much larger the variation would be compared to if there was no multicollinearity whatsoever between the independent variables. And they say that if your standard errors are small from your analysis, then who cares that they could be smaller when your independent variables will be completely independent. Which is an unrealistic scenario anyway.

So if the standard errors indicate that their estimates are precise, then who cares they are precise and that's what we care. So variance inflation factor doesn't really tell us anything useful. On the other hand, they also say that in some scenarios, the rule of thumb that variance inflation factor must not exceed 10 is not enough. So in the previous example, we saw that there was 0.9 correlation corresponding to variance inflation factor of 0.5 which severely made it a lot more difficult for us to identify which one of those models was correct. So we had a collinearity issue, it wasn't detected by variance inflation factor.

So the various inflation factor as Ketokivi and Guide say stating that it must exceed a cut-off without considering the context is nonsense. So that's what they say and I agree with that statement fully. You have to always contextualize what does a statistic mean in your particular study.

Wooldridge also takes some shots at various inflation factor and multicollinearity. So this is from the fourth edition on introduction and he didn't address multicollinearity in the first three editions of his book because he thinks that it is not a useful concept or it's not important enough. Regression analysis does not make any assumptions about multicollinearity, it makes an assumption that it's independent variable should contribute unique information.

So the variables can be perfectly correlated. But it doesn't make any assumptions beyond that. He decided that he's gonna take up this issue because there's so much bad advice about multicollinearity. He says that these explanations of multicollinearity are typically wrongheaded. People are explaining that it is a problem. And then if you have variance inflation factor more than 10 you have to drop variables without really explaining the problem. And what is the consequence of dropping variables from your model.

So let's now, let's take a look at all what it means to solve a multicollinearity problem. So to understand the multicollinearity problem, multicollinearity is a problem in the same sense that the fever is a disease, it is not really a problem per se, it is a symptom and you don't treat the symptom you treat the disease. So if you have a child who has fever, typically cooling down the child by putting them outside the cold temperature is not the right treatment. You have to look at what is the cause of the multicollinearity, cause of the fever and fix the cause instead of trying to fix the symptom.

The typical solution for multicollinearity problems, so how do we make x1 and x2 less correlated. Well, we just drop one from the model. So let's say we drop x2 from the model and that causes in the correct model, in the previous example the correct model was that the effects were 0.25 both. And now if we drop x2, then the estimate of x1 will reflect the influence of x1 and x2 both. So what will happen is that we will overestimate the regression coefficient beta 1 by 90% and the standard errors are smaller. So we will have a false sense of accuracy related to this severely biased estimate.

And also if you have control variables that are collinear with one another, that is irrelevant because typically we just want to know how much of the variation of the dependent variable is explained jointly, by those controllers that we're not really interested in which one of the controls actually explained the dependent variable.

Collinearity between the interesting variables and the controls are important. But if you are just focusing on controls, then it doesn't matter.

Okay so treating collinearity as a problem is the same thing as treating fever as a disease. So it's not a smart thing to do. We have to understand what are the reasons why two variables are so highly correlated that we can't really say which one is the cause of the dependent variable.

So there are a couple of reasons why that could happen. Multicollinearity could be happening because you have mindlessly added a lot of variables into the model. And you shouldn't be adding mindlessly variables to model. All variables that go to your model must be based on theory. So just throwing hundred variables into model typically doesn't make sense. Your models are built to test theory and then they must be driven by theory. So what you think has a causal effect on the Y variable must go into the model and you also must be able to explain why, what's the mechanism that its independent variable influences the dependent variable causally.

So that is one. You have been just mindlessly data mining, and that's a problem. So multicollinearity is not the problem here, the problem is that you're making stupid modeling decisions. The second problem is that you have distinct constructs but their measures are highly correlated and here the primary problem is not multicollinearity but it is discriminant validity. So if two measures of things that are supposed to be distinct are highly correlated it's a problem of measurement validity. I'll address that in a later video.

Then you have two measures of the same construct in the model. For example, if you are studying the effect of a company's size, then you have revenue and personnel both as measures of firm size in the model. That's not a good idea to have two measures of the same thing in the model.

Let's take an extreme example, let's assume that we want to study the effect of a person's height on a person's weight and we have two measures of height. We have centimetres and inches. It doesn't make any sense to try to get the effect of inches independent of the effect of size, in fact, that can't even be estimated.

So if you study, if you have multiple measures of the same thing then typically you should first combine those multiple measures in the single composite measure. I'll cover that later on.

Then the final case is that you are really interested in two closely related constructs. And their distinct effects. For example, you want to know whether a person's age or a person's tenure influences the customer satisfaction scores. That the doctors give to the patients like in Hekman's study. Then you really cannot drop either one of those. You can't say that because tenure and age are highly correlated. We are just gonna use omit tenure and assume that all correlation between age and customer satisfaction is due to the age only and tenure doesn't have an effect. So that is not the right choice. Instead, you have to just increase the sample size. So that you can answer your complicated risk or complex research question in a precise manner.

You are in preview mode.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Multicollinearity (19:37)

Students

Teachers

About service