TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022
Simultaneous equations approach to mediation (15:58)
Description to be added.
This video provides an alternative approach to calculating a mediation
model using a covariance, or correlation in this case, matrix. Parallels
are drawn between path analysis and regression analysis to explain how
to better fit the model to the data.
Click to view transcript
The simultaneous
equation approach is another way of calculating a mediation model. How
this approach works is that we take the mediation model as one large
model and instead of estimating the regressions of y and m separately,
we derive the model implied covariance matrix. I will be using the
correlation metric here for simplification but in practice we work with
covariances nearly always. So we
look at, for example, the correlation between X and Y. We find that we
can go from X to Y using two different paths. We go from X through m to Y
and we go from X to Y directly. So that gives us two ways, or two
elements. We have the mediation effect, beta m1 times beta y2, plus the
direct effect, beta y1. Similarly, we can calculate the correlation
between Y and m, it is the direct path, and it is this spurious
correlation due to X that is a common cause for both. And that gives us
the correlation between m and Y here. How
we estimate this model is that we find the betas so that the data
correlation matrix and model implied correlation matrix are as close to
one another as possible. To understand the calculation, we have to
first take a look at the degrees of freedom because that's important
for this particular problem. The degrees of freedom for this model is
calculated based on these correlations. Because we only use information
from the correlations, we don't look at the individual observations. Our
units of information, the data, we have 5 correlations that depend on
the model parameters. Importantly, the variance of X doesn't count
because that doesn't depend on any of the model parameters. So we have
these five elements, variance of m, variance of Y, and all the
correlations that depend on the model. So we have five units of data. Then
we have five things that we estimate, we have five free parameters, we
have three regression coefficients here, and then we have these two
variance, the variance of this error term and variance of that error
term. So we estimate five different things and the degrees of freedom
for this model is then zero, because it's a difference between these
two. And we say that this is a
'just identified' model. The just identified means, that we can estimate
the model but we are using all the information from the data to
estimate the model and we could not add anything more to the model. It
also means that the model will fit perfectly and what that means I will
explain a bit later in the video. So
we have a just identified model, it means that we can find the values
of these variances and the Beta's, so that the model implied correlation
matrix matches exactly the data correlation matrix. We can do that for
example using the Lavaan package in R. So Lavaan gives us output, you
can do the same with the SEM command in Stata. And the output contains,
importantly, two different sections. So
we have estimation information, and this is not particularly useful.
Because our degrees of freedom is zero, we can't do model testing. If we
have positive degrees of freedom, we could test the model, and I'll
talk more about that later in the video. And then we have these
coefficients. So we have
regressions, we have regressions of Y on m and X, so that's beta y1 and
beta y2. Then we have regression of m on X. That's beta m1, and then we
have these estimated error variances, Ym and Yu, Yu and Ym, that way. So
we get the estimates here, we get the standard errors here, we get
these Z values, they are not are not T values because this is based on
large sample theory. And then we get p-values for these estimates. Then
we can also calculate using this package, the mediation effect, so we
define that into the models. That's something that the software will
calculate for automatically. The standard error, the Z value, and the
p-value for the standard error. And then we have the total effect which
is the effect of X on Y that goes directly, and X on m through m. So
that's a total effect, this influence of X and Y regardless of whether
it goes directly or through m. And then the direct effect is just beta
y1. So that gives us the estimates,
and how it actually then works if we want to test a partial mediation
model. Importantly, these estimates will be the exact same that you get
from regression analysis if you estimate this model separately using
regressions, then you will get the exact same results. There
will be differences once we start to estimate models that are over
identified. For example, if we estimate directly a full mediation model.
So we're saying that there is no path from X to Y. Estimate the model
where we assume that all effects of X to m, X to Y, go through m. And we
apply tracing rules again, we can see that there are equations are a
bit simpler here because we only go from X to Y using this one path,
beta in one beta y2. So there's no
direct path anymore from X to Y, it's only this product. And this has a
positive degrees of freedom. So the data are the same, so we have 5
units of data, but we now only have 4 parameters that we estimate. So we
have two regression coefficients and two error variances. Then the
degrees of freedom is the difference. So we have one degree of freedom
and we call this an over identified model. The
problem, or a feature whether you want to call it that, of these over
identified models is that generally, we cannot make the model implied
correlation matrix to exactly equal the data correlation matrix. Instead
of making those the same and solving, we have to make the model implied
correlation matrix as close as possible to the data correlation matrix. So
to make that model implied correlation matrix as close as possible to
data correlation matrix, we have to define what we mean by close. So we
have to define how we quantify the distance between this, how different
the model implied correlation matrix is from the data correlation
matrix. This problem of quantifying the difference between these two
matrices is comparable to the regression analysis. So
in regression analysis we use the discrepancy function. So we calculate
the difference between a regression line and the actual observations.
And to do that we calculate the residual, so the difference between a
line and the observations. We take the squares of residuals. The idea of
taking squares is that we want to avoid having a large estimation,
large prediction errors. So we are OK with small prediction errors but
we want to avoid having large prediction errors. Then
we take a sum of these squares and that gives us the ordinary least
squares estimator. We minimize that, gives us the regression
coefficient. In path analysis we calculate the difference between each
unique cell in the observed correlation, or covariance, matrix and the
model implied correlation, or covariance, matrix. We
raise those differences to the second power. The idea again is that we
want to avoid having models that explain some parts of the data really
badly and we are kind of OK with models that are slightly off compared
to the data. Then we sum these differences, these squared differences,
and that provides the unweighted least squares estimator. There's
another parallel between our path analysis and regression analysis. So
besides minimizing the discrepancy function, and that gives us estimates
that are in some way ideal. Then the discrepancy can be used to
quantify the goodness of fit of the model. So the R-square, one
definition of R-square regression analysis, is based on this sum of
squares. So we calculate the sum of squares regression and then we
compare that to the total sum of squares and that gives us R-square. Then
here we have the sum of squares of these covariance errors and that can
be used to quantify the model fit as well. Let's take a look. So I
estimated information, there's estimation information again. We have one
degree of freedom for this full mediation model and we have a p-value
that is non significant. I'll go through the p-value shortly. So the
idea of the p-value is that it quantifies how different the actual
observed correlation matrix is from the implied correlation matrix. So
the difference between this observed correlation matrix and this model
implied correlation matrix is called the residual correlation matrix. So
again, there's a parallel to regression analysis residuals where we
work with raw observations like in regression analysis. The residual is
the difference between actual observations and predictive value. Here
when we work with correlations, the residual is the difference between a
predicted correlation and observed correlation. So this residual
correlation matrix here is basically the observed correlations minus the
implied correlations. You can verify that it actually is the case here. So
the question that the p-value here answers whether this small
correlation here can be by chance only. So is it possible that the
sampling error in the observed correlation matrix produces that kind of
discrepancy. That's close to zero, so we can say that that's probably
due to chance, but if it was far from zero, then we would know that this
model doesn't adequately explain the correlation between X and Y, and
we would probably conclude that X has also a direct effect of Y. So it
would be a partial mediation instead of a full mediation model that is
specified here. So that's the test
here, the p-value of about 0.7 indicates that getting this kind of
effect by chance only is plausible. So normally, and this is called an
over identification test because we have one degree of freedom and we
are testing whether that one degree of freedom is consistent with what
we have in the model, we want to accept the chi-square test. We don't
accept the null hypothesis here. The reason is that normally in the
regression analysis we are interested in showing that the null
hypothesis that coefficient is zero is not supported because we usually
want to say that there is an effect. Now
we want to say that there is no difference with the model implied
matrix and the actual matrix. So we are saying that the model implied
matrix fits well to the data and therefore we can conclude that the
model implied matrix is in some sense correct and the model is in some
sense correct. So we want to accept the null hypothesis. If we reject
the null hypothesis then we conclude that this model is inadequate for
the data and we shouldn't make much inferences based on the model
estimates. Instead we should be looking at why the model doesn't explain
the data well and perhaps adjust the model, for example, add the direct
path from X to Y. Now this is,
here we have just one statistic, so we could be just comparing this
statistic against an appropriately chosen normal distribution. We don't
do that, instead we use the chi-square test. The reason is that for more
complicated models, or more complex models, there are typically more
than one element of this residual correlation matrix that is nonzero. So
when we ask the question of can this small difference be by chance
only, we can take a look at the normal distribution and how far from
zero the estimate is. And that gives us the p-value. If we have this, so
that gives us the Z value estimate divided by standard deviation or
estimate divided by the standard error. If
we have two cells here that are different from zero, then we have to do
a test that these both are zero at the same time. So we are looking at
the plane, so instead of looking at one variable we look at two
variables and how far they are from from zero. And you may remember from
our earlier video that in this case, or from your math class in high
school, this distance is calculated by taking a square of this
coordinate and a square of this coordinate taking a sum, and then taking
a square root. In practice, we
don't take the square root because we can just use a reference
distribution that takes the square root into account. So we have the
square of this estimate and square of this estimate. We take a sum and
that gives us the the chi-square, so that statistic. So the chi-square
is the sum of two normally distributed random variables when both have
sum of squares of two normally distributed variables when both have a
mean of zero. So the null
hypothesis is that both of these are zero, then the distribution is
chi-square. So we take one random variable normally distributed,
centered at zero. We square that, we take another one, we square that,
we take a sum, and that gives us the reference distribution. So it's,
basically there's a parallel again, minimize the sum of squared
residuals. Well we want to minimize the sum of squares of these
differences and we quantify these differences by looking at the actual
sum of squares. So we take squares of this estimate and standard error,
which gives is the variance and that gives us the chi-square statistic. So
the logic is that the instead of comparing just one statistic against a
normal distribution, we compare the sum of squares estimate. So sum of
squares of two differences against the the sum of squares of two normal
distributed variables. If it's plausible that a random process of two
normally distributed variables would have produced the same distance
then we conclude that this could be by chance only.