TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022
Confirmatory factor analysis example (13:43)
This video provides an analysis of an actual example of CFA. The video
shows how the factor loadings and variances are arrived at, and how the
model fit is tested.
Click to view transcript
Let's take a look at
an empirical example of confirmatory factor analysis. Our data set for
the example comes from Mesquita and Lazzarini. This is a nice paper
because they present a correlation matrix of all the data on the
indicator level. So we can use their table one - shown here - to
calculate all the confirmatory factor analysis and structural regression
models that the article presents and we will also get, for the most
parts, the exact same results. So
let's check how the confirmatory factor analysis is estimated in and
what the results look like. Specifying the factor analysis model
requires a bit of work. I'll explain you the details of this syntax a
bit later but generally what we do first is that we specify the model.
So we have to specify the indicators and for every indicator we specify
one factor, in this particular case, and then we estimate using
covariance matrix and finally we'll plot the results as a path diagram.
So that's the plotting command and I have added some options to make the
plot to look a bit nicer. So let's
take a look at the model specification in more detail. We have here - I
have color coded this blue is for factors and green is for indicators.
So we specify that we have about eight factors and then we specify how
each indicator loads on the factor. So we have factor
"horizontal" measure with three indicators. We have factor "innovation"
measured with two indicators and then we have Factor "competition"
measure with a single indicator. So we have three indicator factors, two
indicator factors and single indicator factors which are the three
scenarios that I explained in the video about factor scale setting and
identification. So what parameters
do need to estimate? We need to estimate factor loadings. We are going
to be scaling each latent variable using the first indicator fixing
technique so we will estimate factor variances and factor covariances
and indicator error variances. And
the model is identified for using the following approach. We need to set
the scale of each variable, each latent variable. We use the first
indicator fixing and so we fix first indicator at one. That's the
default setting. So we don't have to specify it anyhow here and then we
need to consider how the three two and one indicator rules are applied. So
we have these three indicator factors. They're always identified. We
have two indicator factors. They're identified because they are embedded
in a larger system factor. So we have these two indicator factors where
we can use information from other factors to identify those loadings so
we don't have to do anything special and then for one indicator factors
we fix the error variances to be zero. So we say that these single
indicators or single indicator factors are perfectly reliable. So we say
that the error variances are zero for indicators that are sole
indicators of their factors. So in
as a path diagram the result looks like that. So we have factor
variances here or factor covariance is here - these curves. We have
factor variances this curve that start from a factor and then comes back
to the factor. We have factor loadings - these arrows are from factors
to the indicators and then we have indicator error variances these
curved arrows here. Then these dashed arrows are something that has been
fixed. So that's constrained to be one and that's constrained to be
zero. So that's a single indicator factor's error variance is
constrained to be zero. So that's
what we have and there are funny things. So we can see here that we have
some error variances that are negative. So this is a Heywood case and I
have another video explaining what a Haywood case is and why it occurs.
So we have negative variances - they are close to zero so we can
conclude that maybe these indicators are just highly reliable and the
error variance is actually close to zero and because of sampling error
we get small negative values. So these are small negative values. We
don't really care about that. We assume that they are highly reliable
instead of this being a symptom of model missspecification. Then
I say that these results mostly match what's reported in the paper. So
there's a small mismatch in the factor loadings but otherwise these
factor loadings here match exactly what the article reports. In
text form there are outputs. A couple of things for us. So we have
estimation information first, we have the degrees of freedom and we have
chi-square that I'll explain in the next video, then we have the actual
estimates and the estimates list we have estimate standard error, Z
value and P-value and this goes on for - it's a lot very long printout -
and then we have some warnings. So
the warning here is that we have the Heywood case so both of these
warnings relate to that. Let's take a look at the estimation information
part next. So this is the same kind of information, it's given to you
by any structure regression modeling software. So it's not exclusive to
R. You will get this estimation information and an actual estimates. Let's
take a look at the estimation information and the decrease of freedom
first. So the degrees of freedom is 147 and that's the same as in the
reported article. So where does that 147 come from? This is a good
exercise to calculate the degrees of freedom by hand because then you
will understand what was estimated and there's a nice paper by Cortina
and colleagues were they calculate these degrees of freedoms from
published articles and they check whether they actually match in the
reported degrees of freedom and they don't always match, so that's an
indication that there is something funny going on in the analysis. Let's
do the degrees of freedom calculation. So where does the 147 come from?
We have first 231 unique elements of information. So we had the
correlation matrix all the indicators has 231 unique elements. So that's
the amount of information. Then we start to substract things that we
estimate. So we estimate 10 factor variances. So we have 10 factors.
Each factor has an estimated variance. Then we estimate 45 factor
covariances. So 10 variables have 45 unique correlations. Then we
subtract 11 factor loadings. So remember that when we always fix the
first loading to be 1 to identify the factor so we had 21 indicators -
10 are used to writing for scaling the factor then we estimate 11
loadings. Then we have 18 indicator error variances. We had 21
indicators but three are single indicator factors so we have to fix the
error variance to be zero and that gives 147. So that's the degrees of
freedom. We can check that our analysis actually matches what was done
in the paper by comparing the degrees of freedom and also comparing the
chi-square. The 147 degrees of
freedom tells us that we have excess information that we could estimate
147 more parameters if we want to. After 147 parameters we have used all
information or we couldn't estimate anything anymore. We can also use
the excess information to check if the excess information matches the
predictions from our model and that is the idea of model testing. So we
can use the redundant information to test the model. So we have more
information than we need for model estimation. We can ask whether the
additional information is consistent with our estimates. If it is then
we conclude that the model fits the data well. So
the idea of model testing is that we have the data correlation matrix
here - so that's the first six indicators - then we have the implied
correlation matrix here and then we have the residual correlation matrix
here. Again the estimation criterion was to make this residual
correlation matrix as close to all zeros as possible by adjusting the
model parameters that produce the implied correlation matrix. And these
are pretty close to zero and if our model fits the data perfectly it
means that it preproduces the data perfectly or residuals are zero. And
we want to know if the model is correct for the population. So the
question that we ask now is whether this model would have produced the
population correlation matrix if we had access to that actual population
correlation matrix. In small samples the actual sample correlations are
slightly off so they're not exactly in the population values and
therefore the residuals are not exactly at zero. So we ask the question
are these differences from zero small enough that we can attribute them
to chance? So is it plausible to say that the model is correct but it
doesn't reproduce the data exactly because of small sample fluctuations
in the correlations? This question,
can these residual correlations be by chance only is what the
chi-square statistic quantifies. So we have the chi-square statistic
here. It's a a function of these residuals and it doesn't really have an
interpretation but it's distributed that chi-square with 147 degrees of
freedom and we can calculate the p-value for it. The p-value here is
0.25. So we say that if the residuals were all 0 in the population then
getting this kind of result or greater, by chance only, we would get 25%
of the time. So we then cannot reject the null hypothesis. The null
hypothesis is that these are by chance only. We cannot reject the null
hypothesis therefore we say that the model fits the data well. This is
the logic of the chi-square testing in confirmatory factor analysis and
structural regression models. So we
want to say that these differences are small enough that we can
attribute them to chance only and we accept the null or actually we fail
to reject the null. So then we conclude that this evidence does not
allow us to conclude that the model is misspecified. So we want to have a
p-value here that is non significant because it indicates that our
model is a plausible representation of the data and we conclude that the
model fits. Let's take a look at
the estimation information again. So estimation information gives us the
p-value, the degrees of freedom and chi-square statistic. Then we get
estimates and then we get these warnings. So every time when you get
warnings then you need to actually look at what the warnings mean. So
here our R code actually tells us that we should run inspect(fit,
'theta'). So theta matrix is the error correlation or the residual
indicator error term covariance matrix estimated from the data. And we
should investigate it. So recall that we have the Heywood case. We have
these three negative error variances and then when we do inspection of
the theta matrix - so the theta matrix contains here the estimated error
term variances, so estimated indicator error term variances, all the
covariance within the error terms are constrained to be 0, because we
didn't estimate them in this model. And we can see here that we have
these three negative values here. So
what do we do with that? We conclude that these are so close to zero
that it's plausible that there are actually small positive numbers but
this is just a small sampling fluctuation outcome.