TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Statistical issues in formative models (8:29)
This video explains the statistical issues in specifying and identifying the formative models.
Click to view transcript
The
biggest problem in formative measurement is the idea that the
indicators cause the construct. There are also statistical issues in how
these models are specified and how particularly the models are
identified. I will explain a couple of these issues in this video. There
are a couple more but they are not as important as these two issues.
The
root of the problem is that a formative model - where we specify this
latent variable as a function of these observed variables, three in this
example, and this unobserved error term - is not identified in itself.
It's like a regression analysis without the dependent variable
basically. It's not identified because the correlations within the three
indicators are free and that consumes all degrees of freedom and we
don't have any more information for estimating these paths or the
variance of the error term. So the degrees of freedom is negative.
There
are a couple of ways around this problem. The most commonly recommended
way is that we add two normal indicators. The literature on formative
measurement calls these reflective indicators. So we specify that this
latent variable here actually is a common factor for these two
measurements and these measurements are added for identification of the
model. So this leads to an interesting problem: the problem is that this
latent variable here is now defined by these two normal indicators
instead of these three formative or causal indicators. So these factors -
these indicators measure one and measure two actually give this latent
variable its identity and meaning.
I've written a couple papers
about this topic but the problem essentially is that if these causal
formative indicators are not valid measures of this latent variable -
but these indicators are - then these weights or regression coefficients
here will simply be estimated as 0. So we have a normal latent variable
measured with two indicators and then we have three unrelated
indicators that don't really have any relationship within latent
variable defined by these two variables here. So that's one problem.
Another way of thinking about this is that if we have these two
indicators here that measure the latent variable then these three
indicators here at the bottom are - you don't need them. So you can just
define the model and measure it normally with these two indicators and
there are no problems with that. And that of course doesn't go well with
the idea that some concepts must be measured with these formative
indicators.
So that's one problem and what's the cause of this
phenomenon - that the meaning of this latent variable comes from these
two measures instead of these three measures - is that we have this
error term here and the error term guarantees that whatever these
indicators represent then this error term will make - because it's
unrelated with these three indicators here - it makes the latent
variable to be a common factor of these two indicators. So if these
three indicators are conceptually unrelated to whatever these two
indicators represent then the error term here will compensate for that
and we are basically just modeling the error term with these three
indicators instead of whatever we think that these causal indicators
here cause.
So that's one problem and how we deal with that
problem? We can of course eliminate that problem by eliminating the
error term from the model. But that gives us a - leads to another
problem. So let's consider this kind of model. So here this is not a
latent variable anymore because this formative latent variable is
actually just a weighted sum of these indicators. There's no error term
and this is like a regression analysis without an error term.
Then
how do we set these different weights? So we create an index based on
three different indicators. We set these weights. The normal way of
defining this use or specifying this kind of model is that we have this
latent variable here without the error term and then we have another
latent variable that we want to explain with this latent variable and we
have a regression relationship.
Specifying a model like that
defines these weights so that they maximize this path. And is that
problematic or not? Well it is problematic because if we want to test
for example whether this beta here is zero or not whether the beta has
an effect whether this formative LV has an effect on this other latent
variable then setting these weights so that the beta is as large as
possible it's probably the worst possible way that you can create an
index. So if you want to test if something exists then trying to argue
any correlations in your data to make your estimate as large as possible
it's not a good estimation principle.
So there's possible
positive bias. There is also another problem is that if we set these
weights so that this beta is as large as possible then the weights
actually depend on whatever this other latent variable is and this leads
to a problem called interpretational confounding in this literature. So
the meaning of this latent variable here - that is supposed to be
caused by these three formative indicators - actually depends on what's
the other latent variable with other variables we have in the model. And
that's undesirable.
So if you think about the stock index. Would
it make sense that the stock index would be different depending on who
is using the index? I don't think so. It should be the same. So the
meaning of the index should be same across studies which means that
these indicators - these weights - also must stay the same. Then there's
also the assumption that if these indicators here have any effect on
this other latent variable - then they must be fully mediated by this
formative latent variable.
So let's consider socioeconomic
status. So that's our formative latent variable. One of the indicators
is your education and then we want to explain child's education with
parent's socioeconomic status. Is it reasonable to assume that the
parents education has no other causal effect on child's education than
through the full mediation through social-economic status? That is
clearly unreasonable. So that full mediation assumption here is also
unreasonable.
So what's the alternative? The solution is to
define these weights based on theories. So you set the weights based on
your understanding of the phenomenon instead of trying to estimate the
weights empirically and that leads to index construction. So instead of
doing this complicated latent variable model that possibly has an error
term - we just take the indicators and we take a mean or we take a sum
or we take a weighted sum and we do that before our estimation and we
define the weights for the index construction based on existing
understanding of the phenomenon and or the theory.
And I have
another video of how you can actually do that and how you justify index
construction. So that's clearly a good approach. A lot better approach
than trying to specify these formative latent variable models.