TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Cautionary example of regression analysis (8:23)
The video explains a stepwise regression, which is a data-mining
technique, and why this technique is typically a bad idea for
researchers.
Click to view transcript
Let's take a look at a regression analysis that I got from my former
colleague Pasi Kuusela. So Pasi is a smart guy and ambitious researcher
and he wanted to address a research question of, what really determines
company performance. And he was studying return on assets as his
dependent variable. His sample size was fairly small, so hundred
observations only, but to compensate, he had a very rich data set of 50
observations, 50 variables, with all these important concepts. He
decided to use regression analysis, which we normally do for this kind
of data. And he did a regression analysis explaining return on assets
with these 50 variables. Pasi's initial results were quite impressive.
He got an R-squared of more than 50%. It was like: "Great, we are
explaining more than half the variation of company performance. If we
can do that with just 50 variables, that is probably worth writing a
paper about."
But there is a small problem with Pasi's analysis,
and the thing is that if you have 50 variables, that is too complicated.
You can't really sell that as a coherent theory to a major journal. So
what Pasi did then was to think that, how can we make it more
parsimonious? What if we trimmed down the model to only those variables
whose p-values are less than 0.05, which are highlighted here. So we
will reduce the size of the model to make this more publishable. Pasi
decided to choose those variables and then he redid his analysis. The
results were still impressive. He got an R-squared of more than 40% with
just about the dozen or so variables.
But still, if you say that
performance depends on a dozen different factors, it's not as cool, and
as impressive as saying it depends on, for example, only five or ten,
or five or six factors. So he decided to still only focus on those
variables that actually are statistically significant. So he was
focusing on those variables with p-value less than 0.10, which some
people say is the threshold for marginal significance. And that allowed
him to drop further five variables from the model, resulting in about
ten variables being included. So he re-run the analysis again with the
smaller set of variables and he got R-squared slightly below 40%, still
impressive if you can explain almost half of the performance with about
ten variables, that is great. And then Pasi proceeded to write this
publication.
So what's the problem here? Pasi didn't actually
have any empirical data and he didn't really write a paper about this.
Instead, this was a demonstration for his class, how regression analysis
can fool you if you do data mining. So his data were actually just
random noise, generated with this specific Stata code. So he had 5100
independent random draws from a normal distribution, which means that
his data was purely just noise. There were no statistical relationships
in the population, and no causal effects whatsoever underlying the data.
So he just decided to generate random data, that are uncorrelated in
the population, give them a fancy name, run a regression analysis and
you get 40% explanation. No underlying structure whatsoever.
This
analysis illustrates two problems with regression analysis. One is that
if your sample size is small, compared to the number of variables, Pasi
had two observations for each independent variable, then the R-squared
measure will be inflated. So in the first slide, if you go back, you can
see that the R-squared actually is not statistically significant, and
the adjusted R-squared is pretty close to zero. That indicates that the
variables actually don't explain the dependent variable. But because
R-squared is positively biased, it appears as if you were explaining the
dependent variable, which you are in the sample but it doesn't
generalize to the population. So that is the first problem.
The
second problem is that if you choose your variables based on what the
results are, then all your statistical tests will become invalid. So
remember that the p-value tells you the expected false positive rate,
when there is no effect in the population. So here you had, in the
initial 100 observations, there were about 5 that are expected to be
false positives, if you choose those 5, you will always have
statistically significant results, and there is nothing going on in the
population. So your tests will be biased.
This technique that
Pasi applied is called stepwise regression analysis. So the stepwise
regression analysis, there are different strategies on how you do that,
but it's basically running either a big model first, and then allowing
the computer to trim the model, using some kind of decision rule or
alternatively, you start with an empty model with no independent
variables and a large pool of potential variables, and then the computer
chooses, which variables go to the model, and the objective is to
explain the variation of the dependent variable.
Wooldridge
explains that the significance test will be invalid if you apply this
technique. He doesn't really caution you against the technique. There
are others, including me, who take a much stronger stance again stepwise
regression analysis. Allison's book, he refers to this in his
regression book as automated variable selection, he takes a negative
stance that this is not something that you want to do. He doesn't really
explain why. Then, Kline's book on structural equation modeling
provides the most pointed comments on this technique, and he says that
stepwise regression is something that a computer does for you, and a
computer is not very good at generating models. So instead of looking at
what the computer does for you, you should be using the best research
computer in the world, your own brain, to choose whatever will be in
your model. And then he calls death to stepwise regression, think for
yourself. And I think that's a pretty good recommendation, because the
problem with choosing variables automatically is that you will be
capitalizing on chance, and you will include variables that don't have a
theoretical grounding. So if you're just building your model from the
data, then you will find a large number of false positives or chance
explanations, or at least inflated effects. And then trying to theorize
afterwards, and then present that as if that was the model that you
initially planned to do, that's very unethical because that doesn't
really tell the reader what you wanted to do.
If you try to
publish something with a stepwise regression analysis in a good journal,
you're likely to be desk rejected, because using a stepwise regression
is such a bad idea. If you do the stepwise regression and present it as
if you chose the variables and don't tell anyone that you used stepwise
regression analysis, that is lying and that's unethical.