TU-L0022_aalto-CUR-141790-3063741: Cautionary example of regression analysis (8:23)

Home Schools Course feedback Service Links Intelliboard

This course space end date is set to 06.04.2022 Search Courses: TU-L0022

Cautionary example of regression analysis (8:23)

Receive a grade

The video explains a stepwise regression, which is a data-mining technique, and why this technique is typically a bad idea for researchers.

Click to view transcript

Let's take a look at a regression analysis that I got from my former colleague Pasi Kuusela. So Pasi is a smart guy and ambitious researcher and he wanted to address a research question of, what really determines company performance. And he was studying return on assets as his dependent variable. His sample size was fairly small, so hundred observations only, but to compensate, he had a very rich data set of 50 observations, 50 variables, with all these important concepts. He decided to use regression analysis, which we normally do for this kind of data. And he did a regression analysis explaining return on assets with these 50 variables. Pasi's initial results were quite impressive. He got an R-squared of more than 50%. It was like: "Great, we are explaining more than half the variation of company performance. If we can do that with just 50 variables, that is probably worth writing a paper about."

But there is a small problem with Pasi's analysis, and the thing is that if you have 50 variables, that is too complicated. You can't really sell that as a coherent theory to a major journal. So what Pasi did then was to think that, how can we make it more parsimonious? What if we trimmed down the model to only those variables whose p-values are less than 0.05, which are highlighted here. So we will reduce the size of the model to make this more publishable. Pasi decided to choose those variables and then he redid his analysis. The results were still impressive. He got an R-squared of more than 40% with just about the dozen or so variables.

But still, if you say that performance depends on a dozen different factors, it's not as cool, and as impressive as saying it depends on, for example, only five or ten, or five or six factors. So he decided to still only focus on those variables that actually are statistically significant. So he was focusing on those variables with p-value less than 0.10, which some people say is the threshold for marginal significance. And that allowed him to drop further five variables from the model, resulting in about ten variables being included. So he re-run the analysis again with the smaller set of variables and he got R-squared slightly below 40%, still impressive if you can explain almost half of the performance with about ten variables, that is great. And then Pasi proceeded to write this publication.

So what's the problem here? Pasi didn't actually have any empirical data and he didn't really write a paper about this. Instead, this was a demonstration for his class, how regression analysis can fool you if you do data mining. So his data were actually just random noise, generated with this specific Stata code. So he had 5100 independent random draws from a normal distribution, which means that his data was purely just noise. There were no statistical relationships in the population, and no causal effects whatsoever underlying the data. So he just decided to generate random data, that are uncorrelated in the population, give them a fancy name, run a regression analysis and you get 40% explanation. No underlying structure whatsoever.

This analysis illustrates two problems with regression analysis. One is that if your sample size is small, compared to the number of variables, Pasi had two observations for each independent variable, then the R-squared measure will be inflated. So in the first slide, if you go back, you can see that the R-squared actually is not statistically significant, and the adjusted R-squared is pretty close to zero. That indicates that the variables actually don't explain the dependent variable. But because R-squared is positively biased, it appears as if you were explaining the dependent variable, which you are in the sample but it doesn't generalize to the population. So that is the first problem.

The second problem is that if you choose your variables based on what the results are, then all your statistical tests will become invalid. So remember that the p-value tells you the expected false positive rate, when there is no effect in the population. So here you had, in the initial 100 observations, there were about 5 that are expected to be false positives, if you choose those 5, you will always have statistically significant results, and there is nothing going on in the population. So your tests will be biased.

This technique that Pasi applied is called stepwise regression analysis. So the stepwise regression analysis, there are different strategies on how you do that, but it's basically running either a big model first, and then allowing the computer to trim the model, using some kind of decision rule or alternatively, you start with an empty model with no independent variables and a large pool of potential variables, and then the computer chooses, which variables go to the model, and the objective is to explain the variation of the dependent variable.

Wooldridge explains that the significance test will be invalid if you apply this technique. He doesn't really caution you against the technique. There are others, including me, who take a much stronger stance again stepwise regression analysis. Allison's book, he refers to this in his regression book as automated variable selection, he takes a negative stance that this is not something that you want to do. He doesn't really explain why. Then, Kline's book on structural equation modeling provides the most pointed comments on this technique, and he says that stepwise regression is something that a computer does for you, and a computer is not very good at generating models. So instead of looking at what the computer does for you, you should be using the best research computer in the world, your own brain, to choose whatever will be in your model. And then he calls death to stepwise regression, think for yourself. And I think that's a pretty good recommendation, because the problem with choosing variables automatically is that you will be capitalizing on chance, and you will include variables that don't have a theoretical grounding. So if you're just building your model from the data, then you will find a large number of false positives or chance explanations, or at least inflated effects. And then trying to theorize afterwards, and then present that as if that was the model that you initially planned to do, that's very unethical because that doesn't really tell the reader what you wanted to do.

If you try to publish something with a stepwise regression analysis in a good journal, you're likely to be desk rejected, because using a stepwise regression is such a bad idea. If you do the stepwise regression and present it as if you chose the variables and don't tell anyone that you used stepwise regression analysis, that is lying and that's unethical.

You are in preview mode.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Cautionary example of regression analysis (8:23)

Students

Teachers

About service