TU-L0022_aalto-CUR-141790-3063741: Focus on the basics (12:30)

Focus on the basics (12:30)

This video explains the importance of understanding the basics of statistical research

Click to view transcript

Before we proceed on the course, there's one thing that we need to get out of the way.

Quite often when I start a course, the students ask me: "Mikko I want to learn

structural equation modeling. Can you teach me that?" And my answer to that question is something that people are typically not happy about. I say: “Yes. But let's get started with basic research design, regression analysis and factor analysis.” Sometimes the student responses by telling me that he or she does not want to spend time on the basics but on learning structural equation modeling. That is because they have heard that that's the best technique available and they want to apply that in their research.

To understand this discussion, let's switch to a different context. Let's assume that the student says that he or she wants to learn how to run a marathon. How to run marathons? You can't do that right away. You need to start with jogging. Then the student responds that: “I don't want to go jogging, teach me how to run a marathon.” Anyone who has done any running understands that you can't just go and run a marathon. You need to have built the foundations, build the basic fitness first to run shorter distances. If you can't run, you start by jogging. And then after jogging, you move off to running and so on. You need to build a foundation before you go to advanced things. This is easy to understand because you can simply try running a marathon without going jogging first or doing the basic exercise to build the foundation. It simply does not work. But with any statistical analysis, understanding that you're not actually doing correctly or you're not actually doing something useful is much more difficult.

Let's look at a structural equation model analysis or SEM. If I have a high school student or early bachelor's student, I can teach them how to do an SEM analysis in less than a day. I could even write them instructions and point them to YouTube videos like this that demonstrate - by point and click - how do you specify structural equation modeling. Model using Stata and click on a button to estimate.

Then I can also teach the student that they, if something called a CFI is more than 0.9, they need to declare that their model is good. If it's less than 0.9, the model is bad. Then any interesting arrow here, if there is a thing that's called a P-value is less than 0.05, they declared that they have a finding. If it's more than 0.05 they declare that they don't have a finding. With this kind of algorithm, you can be running SEM analysis in less than a day easily.

But does that produce good research? Simply being able to run an analysis does not mean that you can be a productive researcher with that analysis. How you specify the model, so what is the theory behind these boxes and arrows and how do you justify the boxes and arrows and how do you interpret the results is much more difficult. And it's not like you would follow this kind of rules of some statistic being over a threshold, everything okay, below a threshold, everything bad.

So instead of focusing on how we run an analysis we need to focus on a couple of other things. For example, how do you know when a structural equation model produces valid causal evidence? We can of course specify however complex things we want using these kinds of graphical tools, click a button, and then the software produces us some numbers. How do we know that we can trust those numbers?

I argue that most researchers don’t know that well. And if you look at published research using SEM, even methodological articles about SEM, you can see lots of errors in the analysis part. And how do you justify that what you do is correct? How do you justify that you're using the correct estimates and options for example?

Most people simply go with software defaults. They don't even understand that alternatives would exist and let alone how to choose between those alternatives. How do you know that the model is correct? Do you simply trust that this more model is correct or are you able to run diagnostics to understand why the model might be incorrect, in which way it could be incorrect and how it could be fixed? All this requires lots of expertise and lots of background and basic understanding of how these techniques work.

The interpretation is the problem. A big problem in management research. More generally, people simply look at the P-values instead of looking at how large an effect of a course is on the effect instead of whether there is or is not an effect. We should be looking at the magnitude instead of the existence, and that is beyond the capabilities of most people who apply SEM. So simply being able to run something like this, compare your results against the list of cutoffs and then declaring that your model is good or not good. As hypothesis is supported or not supported based on those cutoffs, that's easy. But that does not produce good research.

Another thing that we can look at is which of these is better? If I teach you regression analysis for two months, then you are much guaranteed to understand regression analysis well and be able to apply it productively. So, it's a basic technique. It's not ideal for all cases, but we can pretty much guarantee that you understand the limitations and you know how to apply it correct. Then you could have an optimal technique. Let's say structural equation modeling will be optimal, but it's applied incorrectly. Which one is better?

I would argue that the basic technique is better because if you have a slightly suboptimal technique, the results are still quite often approximately right or at least to the right direction. But if you apply a technique incorrectly then all bets are off. So even if this complex analysis technique would be better if it was applied correctly, it can be misleading if you don't know how to use it appropriately. Simply being able to point a click a model and click run get some numbers is not the same as being able to use the technique productively. And this is one of the challenges with this modern easy and to use software. The software makes it so easy to apply something that researchers forget to think about their model and whether it makes sense in the first place.

Another thing is that the solid understanding of basics allows you to expand too more

complex techniques later. If you are learning an incorrect application to start with, like people come to the class with the understanding that for example, our R square measures how good your model is or that chi-square can be fully ignored or that CFI greater than 0.9 is always good.

These bad practices are difficult to unlearn. I've done lots of unlearning myself. It's better to learn the basics well and then build the complex things over the basics. Rather than try to learn something that allows you to publish and then unlearn it to be able to do good research. Getting published and doing good research, they correlate, but they're not the same thing.

Learning and applying complex technique takes time that would be better invested in looking at the research design. Quite often, how good your data are, how good the research question is that is what determines the significance or importance of your finding, not how complicated and sophisticated analysis you had. If I do a paper, it probably applies a complicated analysis or complex analysis because I know how to apply those things. But if you don't know and you must learn, then that learning time is better spent on thinking about the basis in the fundamentals right: to have data that can inform policy and research. And then complex techniques are often unnecessary.

There's a nice paper by Dan McNeese in psychological methods where he looks at applications

of multilevel modeling and he concludes that in most cases when multilevel modeling is applied, it's completely unnecessary. Researchers could have easily just applied regression analysis. It's months less reading to learn regression analysis than to really learn about multilevel modeling. It's a lot more effective if it fits in, you use regression instead of multilevel modeling, and it's also less likely to be misused.

And then there's the thing that no one has ever rejected a paper for using regression analysis. I don't know if anyone else beyond me has ever said that, but that seems to be true. If there is a paper that has a great data set, great research question, but the analysis is not the correct one, then you are asked to revise. Analysis is super easy to fix. If you don't have the right variables, if your observations are incorrect like you use student data to study boardroom behavior and that kind of things, they can be fixed. You need to redo the study. But with data analysis is used if a structural equation model or multilevel model really is needed, then you can just switch the analysis, rerun, resubmit the reason of your paper.

This is also echoed in this nice paper by Aguinis and Vanderberg. And they state based on their substantial experience, that the problem with articles that are submitted to good journals is they're typically not in the analysis part. Articles are not rejected because of bad analysis. They're rejected because of data issues or research questions that are not worth asking. Data analysis issues don't lead to rejections they can always be fixed later. And then they point out this thing that I've seen myself is that some researchers tend to think that going for complex technique makes a paper better and use it to focus on the techniques more than the actual research design.

Quite often when your research design is bad, you can't really fix it with complex technique. And if I look at articles that I've evaluated in the past, I think most of the articles that I have rejected are applications of structural equation model. Most of those articles that test causal claims that go forward in the review process, apply regression analysis and the authors have been spending the time on thinking about the research design instead of thinking about how to make the analysis as complex as possible.

If one of you insists that you want to run a marathon or want to learn regression modeling, that can be done. But we start with the basics, then we have a long list of things to go through after which you will really understand how you apply structural regression modeling correctly, how you do diagnostics for the model; how do you ensure that the results are trustworthy? What do you do when the software gives you an error and so on? But that requires considerable amount of expertise done. It's better to start with the basics. And then after you have the basics down, you have written a paper with the basics, then you go for the advanced stuff.

Det här innehållet visas i förhandsgranskningsläge. Ingen spårning av försök kommer att lagras.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Focus on the basics (12:30)

Students

Teachers

Service