TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022
Focus on the basics (12:30)
This video explains the importance of understanding the basics of statistical research
Click to view transcript
Before we proceed on the course, there's one thing that we need to get out of the way. Quite often when I start a course, the students ask me: "Mikko I want to learn structural equation modeling. Can you teach me that?" And my answer to that question is something that people are typically not happy about. I say: “Yes. But let's get started with basic research design, regression analysis and factor analysis.” Sometimes
the student responses by telling me that he or she does not want to
spend time on the basics but on learning structural equation modeling.
That is because they have heard that that's the best technique available
and they want to apply that in their research. To
understand this discussion, let's switch to a different context. Let's
assume that the student says that he or she wants to learn how to run a
marathon. How to run marathons? You can't do that right away. You need
to start with jogging. Then the student responds that: “I don't want to go jogging, teach me how to run a marathon.” Anyone
who has done any running understands that you can't just go and run a
marathon. You need to have built the foundations, build the basic
fitness first to run shorter distances. If you can't run, you start by
jogging. And then after jogging, you move off to running and so on. You
need to build a foundation before you go to advanced things. This is
easy to understand because you can simply try running a marathon without
going jogging first or doing the basic exercise to build the
foundation. It simply does not work. But with any statistical analysis,
understanding that you're not actually doing correctly or you're not
actually doing something useful is much more difficult. Let's
look at a structural equation model analysis or SEM. If I have a high
school student or early bachelor's student, I can teach them how to do
an SEM analysis in less than a day. I could even write them instructions
and point them to YouTube videos like this that demonstrate - by point
and click - how do you specify structural equation modeling. Model using
Stata and click on a button to estimate. Then I can also teach the student that they, if something called a CFI
is more than 0.9, they need to declare that their model is good. If
it's less than 0.9, the model is bad. Then any interesting arrow here,
if there is a thing that's called a P-value is less than 0.05, they
declared that they have a finding. If it's more than 0.05 they declare
that they don't have a finding. With this kind of algorithm, you can be
running SEM analysis in less than a day easily. But
does that produce good research? Simply being able to run an analysis
does not mean that you can be a productive researcher with that
analysis. How you specify the model, so what is the theory behind these
boxes and arrows and how do you justify the boxes and arrows and how do
you interpret the results is much more difficult. And it's not like you
would follow this kind of rules of some statistic being over a
threshold, everything okay, below a threshold, everything bad. So
instead of focusing on how we run an analysis we need to focus on a
couple of other things. For example, how do you know when a structural
equation model produces valid causal evidence? We can of course specify
however complex things we want using these kinds of graphical tools,
click a button, and then the software produces us some numbers. How do
we know that we can trust those numbers? I
argue that most researchers don’t know that well. And if you look at
published research using SEM, even methodological articles about
SEM, you can see lots of errors in the analysis part. And how do you
justify that what you do is correct? How do you justify that you're
using the correct estimates and options for example? Most
people simply go with software defaults. They don't even understand
that alternatives would exist and let alone how to choose between those
alternatives. How do you know that the model is correct? Do you simply
trust that this more model is correct or are you able to run diagnostics
to understand why the model might be incorrect, in which way it could
be incorrect and how it could be fixed? All this requires lots of
expertise and lots of background and basic understanding of how these
techniques work. The
interpretation is the problem. A big problem in management research.
More generally, people simply look at the P-values instead of looking at
how large an effect of a course is on the effect instead of whether
there is or is not an effect. We should be looking at the magnitude
instead of the existence, and that is beyond the capabilities of most
people who apply SEM. So simply being able to run something like this,
compare your results against the list of cutoffs and then declaring that
your model is good or not good. As hypothesis is supported or not
supported based on those cutoffs, that's easy. But that does not produce
good research. Another
thing that we can look at is which of these is better? If I teach you
regression analysis for two months, then you are much guaranteed to
understand regression analysis well and be able to apply it
productively. So, it's a basic technique. It's not ideal for all cases,
but we can pretty much guarantee that you understand the limitations and
you know how to apply it correct. Then you could have an optimal
technique. Let's say structural equation modeling will be optimal, but
it's applied incorrectly. Which one is better? I
would argue that the basic technique is better because if you have a
slightly suboptimal technique, the results are still quite often
approximately right or at least to the right direction. But if you apply
a technique incorrectly then all bets are off. So even if this complex
analysis technique would be better if it was applied correctly, it can
be misleading if you don't know how to use it appropriately. Simply
being able to point a click a model and click run get some numbers is
not the same as being able to use the technique productively. And this
is one of the challenges with this modern easy and to use software. The
software makes it so easy to apply something that researchers forget to
think about their model and whether it makes sense in the first place. Another thing is that the solid understanding of basics allows you to expand too more complex
techniques later. If you are learning an incorrect application to start
with, like people come to the class with the understanding that for
example, our R square measures how good your model is or that chi-square
can be fully ignored or that CFI greater than 0.9 is always good. These
bad practices are difficult to unlearn. I've done lots of unlearning
myself. It's better to learn the basics well and then build the complex
things over the basics. Rather than try to learn something that allows
you to publish and then unlearn it to be able to do good research.
Getting published and doing good research, they correlate, but they're
not the same thing. Learning
and applying complex technique takes time that would be better invested
in looking at the research design. Quite often, how good your data are,
how good the research question is that is what determines the
significance or importance of your finding, not how complicated and
sophisticated analysis you had. If I do a paper, it probably applies a
complicated analysis or complex analysis because I know how to apply
those things. But if you don't know and you must learn, then that
learning time is better spent on thinking about the basis in the
fundamentals right: to have data that can inform policy and research.
And then complex techniques are often unnecessary. There's a nice paper by Dan McNeese in psychological methods where he looks at applications of
multilevel modeling and he concludes that in most cases when multilevel
modeling is applied, it's completely unnecessary. Researchers could
have easily just applied regression analysis. It's months less reading
to learn regression analysis than to really learn about multilevel
modeling. It's a lot more effective if it fits in, you use regression
instead of multilevel modeling, and it's also less likely to be
misused. And
then there's the thing that no one has ever rejected a paper for using
regression analysis. I don't know if anyone else beyond me has ever said
that, but that seems to be true. If there is a paper that has a great
data set, great research question, but the analysis is not the correct
one, then you are asked to revise. Analysis is super easy to fix. If you
don't have the right variables, if your observations are incorrect like
you use student data to study boardroom behavior and that kind of
things, they can be fixed. You need to redo the study. But with data
analysis is used if a structural equation model or multilevel model
really is needed, then you can just switch the analysis, rerun, resubmit
the reason of your paper. This
is also echoed in this nice paper by Aguinis and Vanderberg. And they
state based on their substantial experience, that the problem with
articles that are submitted to good journals is they're typically not in
the analysis part. Articles are not rejected because of bad analysis.
They're rejected because of data issues or research questions that are
not worth asking. Data analysis issues don't lead to rejections they can
always be fixed later. And then they point out this thing that I've
seen myself is that some researchers tend to think that going for
complex technique makes a paper better and use it to focus on the
techniques more than the actual research design. Quite
often when your research design is bad, you can't really fix it with
complex technique. And if I look at articles that I've evaluated in the
past, I think most of the articles that I have rejected are applications
of structural equation model. Most of those articles that test causal
claims that go forward in the review process, apply regression analysis
and the authors have been spending the time on thinking about the
research design instead of thinking about how to make the analysis as
complex as possible. If
one of you insists that you want to run a marathon or want to learn
regression modeling, that can be done. But we start with the basics,
then we have a long list of things to go through after which you will
really understand how you apply structural regression modeling
correctly, how you do diagnostics for the model; how do you ensure that
the results are trustworthy? What do you do when the software gives you
an error and so on? But that requires considerable amount of expertise
done. It's better to start with the basics. And then after you have the
basics down, you have written a paper with the basics, then you go for
the advanced stuff.