TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022
Centering and collinearity of interactions (9:33)
This video covers a common method to address multicollinearity. Centering is often done before forming the interaction term between independent variables and affects the intercept. However, centering doesn't solve potential interaction term issues and it can lead to incorrect interpretations and predictions.
Click to view transcript
It's fairly common that authors center their variables
before they form the interaction term. In this video I will take a look
at whether that is actually necessary and whether you should do that or
not. In Heckman's paper, the authors argued that they did center the
variables to reduce the multicollinearity. The idea of centering and
multicollinearity is that if you have X and M, and then you form a
product of X and M, then the product will be correlated in both X and M
because those two variables form the interaction, and by centering we
can reduce those correlations. So
let's take a look at some data, and we have two random numbers here x1
and x2. Here the x1 and x2 have means of two, and here we have centered
the variables x1 and x2 to have means of 0. So the idea of centering is
that you take the original variable and then you substract the mean, and
that will make the mean of the variable to be 0, and we say that the
variable is centered. The bar symbol over the X means that it's
centered, it's the mean of that variable. And
standardization is centering and dividing by standard deviation. We can
see here that even the x1 and x2 are not very strongly correlated, so
that's the pattern, which is no particular pattern. But when we multiply
x1 and x2 together, then that product is highly correlated with x1 and
x2. So there's a strong statistical relationship. When we center the
variables, the bivariate relationship here stays the same. But
we can see that the relationship between x1 and x2 and their product is
quite different. There's still a strong statistical relationship, so
when x1 or x2 goes to 0, then there's no variation in the data and then
it spreads out when x1 and x2 increases. So there's still a strong
statistical association but it is no longer a linear association. So
what's the implication for regression analysis with this centering
stuff? On the left hand side, we have the variable where the regression
analysis for the data that is not centered, and on the right hand side,
we have regression analysis for the centered data. And we can see the
difference, what the centering does, for regression of Y on x1 and x2 is
that it just senses the intercept. So
only the intercept is different and the first-order effects of x1 and
x2 are the same. Which is quite natural because when you center, you're
simply subtracting something from x and something from x2, and because
you subtract the same number for every observation, that will only alter
the intercept. It doesn't affect the covariance of x1 with x2 and the
covariance between those two variables and Y. Those are unaffected by
centering. So centering will only affect means, and in normal regression
analysis, it only affects the intercept. What's
the downside of centering is that once we calculate predictions, here
the predictions for this model are on the original metric so we will get
predictions on whatever the Y is, and if we calculate predictions using
this model, then the predictions will be off by the amount that we
centered. So for example if we're predicting salary, and let's say this
model would give 10,000 euros per year, then this model could give minus
2000 euros. Which doesn't make sense unless we back convert or back
translate that effect to the non-centered variables. So
centering makes predictions and doing plots that apply predictions more
difficult. And that's important for interactions for reasons that I'll
explain in the last slide. When we take an interaction term we can see
now that there are some more differences. Importantly the differences
are only in the first three coefficients. So
intercept again is different, which is expected, but now x1 and x2
coefficients are different but their interaction of x1 and x2 is the
exact same number. So the centering actually doesn't influence the
interaction term at all. It influences only the first-order
coefficients. So is that something that you want to do or not? To answer
that question we have to consider what exactly the centering means, and
what exactly it means that we have this interaction term here. Let's
take a look at a graph. So here the x1 and x2 effects are when x1 and
x2 is 0, and here the x1 and x3 effects are the mean effects. So when x1
and x2 are at their means, then that's what the x1 and x3 effects are.
What that means can be understood by looking at this graphically. So we
have here a space and there is a plane in the space. Here we have x1 on
this axis, we have x2 on this axis and then we have Y here. So
when we have two coefficients, or two variables, in a regression
analysis as two independent variables, then the regression is a plane in
three-dimensional space. And we can see the plane here. And because of
the interaction, the effect of x1 on Y is the strength of that effect,
which is contingent on the value of x2. So here when x2 is at 0, then x1
simply increases a little, so the effect is not that great. When
x2 is at 5, the effect is a lot like greater so we have this, we see a
lot steeper slope here. So the idea is that that the regression slope of
x1 changes as a function of x2. Also the intercept changes so this line
goes down here. So what centering does is that normally when we do an
interaction term, we take the effect of x1. So the regression with
interaction gives you the effects of x1, effects of x2 and their
product. When we don't center our
data, the effect of x1 is this blue line here, so it's the effect of x1
when x2 is 0. Similarly, the effect of x2 is the effect of x2 when x1 is
0. When we center instead of taking the effect of x2 at 0 for x1, we
take an effect of x1 when x2 is at its mean. So we take this green line
in the middle. So the centering just influences which of these possible
lines do we take it from. Here, from here or perhaps all the way from
the other end of the data. So it just changes at which part of the
regression plane we are looking at. But
the problem is that you have to look at multiple places. So you can't
summarize this plane by saying that the effect of x1 is this line. You
have to show multiple lines. So it doesn't really matter which of these
lines you show in your regression table. And that's the problem. So
you have to do these interaction plots. So you have to show multiple
plots, you show that the slope of x1 depends on the value of x2. And the
widths of these lines that we show in the regression table are
arbitrary, so it doesn't really matter because we have to present these
kinds of plots anyway. So what we
show here, whether we have the effect of x1 here to be the blue, green
or red line, doesn't really make a difference. We have to show all the
lines anyway. The problem with centering is that once we center our
variables, then the interaction plot, the values of the predictive
values of Y, will be incorrect by the amount that we center the data. So
we can no longer do predictions usefully, we have to convert the
predictions back to the noncentral metric for them to make sense. So
centering is not useful because it doesn't do anything for the
interpretation. You will have to interpret the results with this kind of
plot anyway, and centering will be harmful for this plot because it
makes forming these plots more difficult because you have to back
convert your variables of the original metric to get the predictions
correct. So because of these considerations, my recommendation is never
center your data. It's not useful and it is harmful