TU-L0022_aalto-CUR-141790-3063741: Statistical issues in formative models (8:29)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Statistical issues in formative models (8:29)

Vaatii arvosanan

This video explains the statistical issues in specifying and identifying the formative models.

Click to view transcript

The biggest problem in formative measurement is the idea that the indicators cause the construct. There are also statistical issues in how these models are specified and how particularly the models are identified. I will explain a couple of these issues in this video. There are a couple more but they are not as important as these two issues.

The root of the problem is that a formative model - where we specify this latent variable as a function of these observed variables, three in this example, and this unobserved error term - is not identified in itself. It's like a regression analysis without the dependent variable basically. It's not identified because the correlations within the three indicators are free and that consumes all degrees of freedom and we don't have any more information for estimating these paths or the variance of the error term. So the degrees of freedom is negative.

There are a couple of ways around this problem. The most commonly recommended way is that we add two normal indicators. The literature on formative measurement calls these reflective indicators. So we specify that this latent variable here actually is a common factor for these two measurements and these measurements are added for identification of the model. So this leads to an interesting problem: the problem is that this latent variable here is now defined by these two normal indicators instead of these three formative or causal indicators. So these factors - these indicators measure one and measure two actually give this latent variable its identity and meaning.

I've written a couple papers about this topic but the problem essentially is that if these causal formative indicators are not valid measures of this latent variable - but these indicators are - then these weights or regression coefficients here will simply be estimated as 0. So we have a normal latent variable measured with two indicators and then we have three unrelated indicators that don't really have any relationship within latent variable defined by these two variables here. So that's one problem. Another way of thinking about this is that if we have these two indicators here that measure the latent variable then these three indicators here at the bottom are - you don't need them. So you can just define the model and measure it normally with these two indicators and there are no problems with that. And that of course doesn't go well with the idea that some concepts must be measured with these formative indicators.

So that's one problem and what's the cause of this phenomenon - that the meaning of this latent variable comes from these two measures instead of these three measures - is that we have this error term here and the error term guarantees that whatever these indicators represent then this error term will make - because it's unrelated with these three indicators here - it makes the latent variable to be a common factor of these two indicators. So if these three indicators are conceptually unrelated to whatever these two indicators represent then the error term here will compensate for that and we are basically just modeling the error term with these three indicators instead of whatever we think that these causal indicators here cause.

So that's one problem and how we deal with that problem? We can of course eliminate that problem by eliminating the error term from the model. But that gives us a - leads to another problem. So let's consider this kind of model. So here this is not a latent variable anymore because this formative latent variable is actually just a weighted sum of these indicators. There's no error term and this is like a regression analysis without an error term.

Then how do we set these different weights? So we create an index based on three different indicators. We set these weights. The normal way of defining this use or specifying this kind of model is that we have this latent variable here without the error term and then we have another latent variable that we want to explain with this latent variable and we have a regression relationship.

Specifying a model like that defines these weights so that they maximize this path. And is that problematic or not? Well it is problematic because if we want to test for example whether this beta here is zero or not whether the beta has an effect whether this formative LV has an effect on this other latent variable then setting these weights so that the beta is as large as possible it's probably the worst possible way that you can create an index. So if you want to test if something exists then trying to argue any correlations in your data to make your estimate as large as possible it's not a good estimation principle.

So there's possible positive bias. There is also another problem is that if we set these weights so that this beta is as large as possible then the weights actually depend on whatever this other latent variable is and this leads to a problem called interpretational confounding in this literature. So the meaning of this latent variable here - that is supposed to be caused by these three formative indicators - actually depends on what's the other latent variable with other variables we have in the model. And that's undesirable.

So if you think about the stock index. Would it make sense that the stock index would be different depending on who is using the index? I don't think so. It should be the same. So the meaning of the index should be same across studies which means that these indicators - these weights - also must stay the same. Then there's also the assumption that if these indicators here have any effect on this other latent variable - then they must be fully mediated by this formative latent variable.

So let's consider socioeconomic status. So that's our formative latent variable. One of the indicators is your education and then we want to explain child's education with parent's socioeconomic status. Is it reasonable to assume that the parents education has no other causal effect on child's education than through the full mediation through social-economic status? That is clearly unreasonable. So that full mediation assumption here is also unreasonable.

So what's the alternative? The solution is to define these weights based on theories. So you set the weights based on your understanding of the phenomenon instead of trying to estimate the weights empirically and that leads to index construction. So instead of doing this complicated latent variable model that possibly has an error term - we just take the indicators and we take a mean or we take a sum or we take a weighted sum and we do that before our estimation and we define the weights for the index construction based on existing understanding of the phenomenon and or the theory.

And I have another video of how you can actually do that and how you justify index construction. So that's clearly a good approach. A lot better approach than trying to specify these formative latent variable models.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Statistical issues in formative models (8:29)

Opiskelijoille

Opettajille

Palvelusta