TU-L0022 - Statistical Research Methods D, Lecture, 25.10.2022-29.3.2023
Kurssiasetusten perusteella kurssi on päättynyt 29.03.2023 Etsi kursseja: TU-L0022
Suppression in regression (7:10)
This video explains the
suppression effect in regression. It explains why the regression
coefficient between two variables can have an opposite sign of
correlation coefficient.
Click to view transcript
In this video I will explain the suppression effect in regression analysis. The suppression effect is a term that is used for a feature of regression analysis. You don't actually have to understand what the term suppression means, but you have to understand why certain results are sometimes occurring in regression analysis. You will basically need the term suppression only if a reviewer argues that you should explain suppression for example. I don't think there is any valid reason to discuss suppression in an empirical paper unless your reviewer asked you to do.
Let's take a look at Hekman's paper because they mention suppression. So, they explain that in their correlation table and regression table the physician age, it has different sign. So in the correlation table the correlation with physician age and patient satisfaction is positive, in regression results it is negative. And that is the suppression effect.
The technical definition is unimportant here. Then they explain that these variables may somehow be suppression the variance of the dependent variable that is irrelevant to its prediction. I don't understand what that means, so that doesn't really have any literal meaning. Then they cite a textbook in statistical analysis that presumably explains what they mean, unfortunately that is a big book and they don't give a page number, so we can't really meaningfully check what that book says about suppression.
So whenever you you explain something and then you give a reader a book to read, then at least give the reader some indication which chapter or which page of that book explains the fact that you're referring to. Otherwise this is having your hundreds or thousands of readers to browse through this book and waste their time, looking for a fact whose location you already know. Because you wouldn't be citing the book unless you have read it.
Then they explained that, the correlation is not statistically significant and they tried different models and the results were unchanged, and they conclude that suppression is not a problem. I agree with the explanation that the suppression is not a problem but not for the reasons that they explained. So suppression effect is not something that when it occurs it is problematic. it is a feature of regression analysis.
So, let us take a look at their actual statistics. So what are the numbers that they refer to. So they identified that the correlation between our physician age and patient satisfaction is positive and their corresponding regression coefficient is negative.
So, why could that be the case? We have to remember that correlation and regression coefficient quantify different things. So, regression coefficient ideally quantifies the causal relationship, under certain assumptions. Correlation coefficient quantifies a linear association, that could be causal or it could be spurious. Its very simple to see here why the physician age is correlated positively with satisfaction but why the regression coefficient is negative.
We just need to look at the correlation table. So lets take a look at the correlation table. We first look at which variables are highly correlated with age. Well its the tenure, so tenure is correlated with age at very high level. Then we look at what's the regression coefficient of tenure here, its very strong positive. So the more experience you have the more satisfied your patients are. Also experience correlates with age, which is quite natural because if you are like 25 nearly graduated medical doctor you can't have much experience. If you are someone with 30 years of work experience as a doctor you must be more than 50 because normally you are more than 20 when you graduate from medical school. So age and tenure, age and work experience naturally correlate very highly.
So what's going on? Remember that the linear model implies a correlation matrix. So, what is the implied correlation between age and patient satisfaction based on the correlation between tenure and the effects of tenure and age. So, we go from age to patient satisfaction, we take that path once, -.13 and we take the correlation path .69 times .34, this correlation path. So that gives us some some math, we get the implied correlation based on this part of the model only is 0.1, which is very close to 0.09 which is a positive correlation.
So why is there are a different sign? It is pretty straightforward and we have a natural explanation. When regression coefficients quantify their effect of one variable, other variables are held constant. So, when the regression coefficient tells us that when you have two physicians with equal work experience, people tend to prefer the younger one. That is natural. But people also tend to prefer doctors with more experience and those doctors are older, tend to have more experience, and the experience is the variable that matters more than the age. So the correlation here 0.09 reflects the effect of age itself which is negative based on this model, and a spurious effect due to those doctors that are more experienced are also older and receive better scores. So this correlation is a sum of the spurious effect and a direct effect and in this case the spurious effect due to correlation with tenure and the effect of tenure, which is strong, is a lot stronger than the direct effect of age. Therefore we get positive correlation.
So that is how regression analysis works, it gets a correlation and it tries to identify how much of that correlation is spurious, how much of that correlation corresponds to a causal relationship. Sometimes the spurious part is a lot larger than the actual causal effect part and that can cause the regression coefficient to have a different sign than the correlation coefficient. It is not a problem it is how regression analysis works.