TU-L0022_aalto-CUR-141790-3063741: Centering and collinearity of interactions (9:33)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Centering and collinearity of interactions (9:33)

Vaatii arvosanan

This video covers a common method to address multicollinearity. Centering is often done before forming the interaction term between independent variables and affects the intercept. However, centering doesn't solve potential interaction term issues and it can lead to incorrect interpretations and predictions.

Click to view transcript

It's fairly common that authors center their variables before they form the interaction term. In this video I will take a look at whether that is actually necessary and whether you should do that or not. In Heckman's paper, the authors argued that they did center the variables to reduce the multicollinearity. The idea of centering and multicollinearity is that if you have X and M, and then you form a product of X and M, then the product will be correlated in both X and M because those two variables form the interaction, and by centering we can reduce those correlations.

So let's take a look at some data, and we have two random numbers here x1 and x2. Here the x1 and x2 have means of two, and here we have centered the variables x1 and x2 to have means of 0. So the idea of centering is that you take the original variable and then you substract the mean, and that will make the mean of the variable to be 0, and we say that the variable is centered. The bar symbol over the X means that it's centered, it's the mean of that variable.

And standardization is centering and dividing by standard deviation. We can see here that even the x1 and x2 are not very strongly correlated, so that's the pattern, which is no particular pattern. But when we multiply x1 and x2 together, then that product is highly correlated with x1 and x2. So there's a strong statistical relationship. When we center the variables, the bivariate relationship here stays the same.

But we can see that the relationship between x1 and x2 and their product is quite different. There's still a strong statistical relationship, so when x1 or x2 goes to 0, then there's no variation in the data and then it spreads out when x1 and x2 increases. So there's still a strong statistical association but it is no longer a linear association.

So what's the implication for regression analysis with this centering stuff? On the left hand side, we have the variable where the regression analysis for the data that is not centered, and on the right hand side, we have regression analysis for the centered data. And we can see the difference, what the centering does, for regression of Y on x1 and x2 is that it just senses the intercept.

So only the intercept is different and the first-order effects of x1 and x2 are the same. Which is quite natural because when you center, you're simply subtracting something from x and something from x2, and because you subtract the same number for every observation, that will only alter the intercept. It doesn't affect the covariance of x1 with x2 and the covariance between those two variables and Y. Those are unaffected by centering. So centering will only affect means, and in normal regression analysis, it only affects the intercept.

What's the downside of centering is that once we calculate predictions, here the predictions for this model are on the original metric so we will get predictions on whatever the Y is, and if we calculate predictions using this model, then the predictions will be off by the amount that we centered. So for example if we're predicting salary, and let's say this model would give 10,000 euros per year, then this model could give minus 2000 euros. Which doesn't make sense unless we back convert or back translate that effect to the non-centered variables.

So centering makes predictions and doing plots that apply predictions more difficult. And that's important for interactions for reasons that I'll explain in the last slide. When we take an interaction term we can see now that there are some more differences. Importantly the differences are only in the first three coefficients.

So intercept again is different, which is expected, but now x1 and x2 coefficients are different but their interaction of x1 and x2 is the exact same number. So the centering actually doesn't influence the interaction term at all. It influences only the first-order coefficients. So is that something that you want to do or not? To answer that question we have to consider what exactly the centering means, and what exactly it means that we have this interaction term here.

Let's take a look at a graph. So here the x1 and x2 effects are when x1 and x2 is 0, and here the x1 and x3 effects are the mean effects. So when x1 and x2 are at their means, then that's what the x1 and x3 effects are. What that means can be understood by looking at this graphically. So we have here a space and there is a plane in the space. Here we have x1 on this axis, we have x2 on this axis and then we have Y here.

So when we have two coefficients, or two variables, in a regression analysis as two independent variables, then the regression is a plane in three-dimensional space. And we can see the plane here. And because of the interaction, the effect of x1 on Y is the strength of that effect, which is contingent on the value of x2. So here when x2 is at 0, then x1 simply increases a little, so the effect is not that great.

When x2 is at 5, the effect is a lot like greater so we have this, we see a lot steeper slope here. So the idea is that that the regression slope of x1 changes as a function of x2. Also the intercept changes so this line goes down here. So what centering does is that normally when we do an interaction term, we take the effect of x1. So the regression with interaction gives you the effects of x1, effects of x2 and their product.

When we don't center our data, the effect of x1 is this blue line here, so it's the effect of x1 when x2 is 0. Similarly, the effect of x2 is the effect of x2 when x1 is 0. When we center instead of taking the effect of x2 at 0 for x1, we take an effect of x1 when x2 is at its mean. So we take this green line in the middle. So the centering just influences which of these possible lines do we take it from. Here, from here or perhaps all the way from the other end of the data. So it just changes at which part of the regression plane we are looking at.

But the problem is that you have to look at multiple places. So you can't summarize this plane by saying that the effect of x1 is this line. You have to show multiple lines. So it doesn't really matter which of these lines you show in your regression table. And that's the problem.

So you have to do these interaction plots. So you have to show multiple plots, you show that the slope of x1 depends on the value of x2. And the widths of these lines that we show in the regression table are arbitrary, so it doesn't really matter because we have to present these kinds of plots anyway.

So what we show here, whether we have the effect of x1 here to be the blue, green or red line, doesn't really make a difference. We have to show all the lines anyway. The problem with centering is that once we center our variables, then the interaction plot, the values of the predictive values of Y, will be incorrect by the amount that we center the data.

So we can no longer do predictions usefully, we have to convert the predictions back to the noncentral metric for them to make sense. So centering is not useful because it doesn't do anything for the interpretation. You will have to interpret the results with this kind of plot anyway, and centering will be harmful for this plot because it makes forming these plots more difficult because you have to back convert your variables of the original metric to get the predictions correct. So because of these considerations, my recommendation is never center your data. It's not useful and it is harmful

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Centering and collinearity of interactions (9:33)

Opiskelijoille

Opettajille

Palvelusta