TU-L0022_aalto-CUR-141790-3063741: Confirmatory factor analysis example (13:43)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Confirmatory factor analysis example (13:43)

Vaatii arvosanan

This video provides an analysis of an actual example of CFA. The video shows how the factor loadings and variances are arrived at, and how the model fit is tested.

Click to view transcript

Let's take a look at an empirical example of confirmatory factor analysis. Our data set for the example comes from Mesquita and Lazzarini. This is a nice paper because they present a correlation matrix of all the data on the indicator level. So we can use their table one - shown here - to calculate all the confirmatory factor analysis and structural regression models that the article presents and we will also get, for the most parts, the exact same results.

So let's check how the confirmatory factor analysis is estimated in and what the results look like. Specifying the factor analysis model requires a bit of work. I'll explain you the details of this syntax a bit later but generally what we do first is that we specify the model. So we have to specify the indicators and for every indicator we specify one factor, in this particular case, and then we estimate using covariance matrix and finally we'll plot the results as a path diagram. So that's the plotting command and I have added some options to make the plot to look a bit nicer.

So let's take a look at the model specification in more detail. We have here - I have color coded this blue is for factors and green is for indicators. So we specify that we have about eight factors and then we specify how each indicator loads on the factor.

So we have factor "horizontal" measure with three indicators. We have factor "innovation" measured with two indicators and then we have Factor "competition" measure with a single indicator. So we have three indicator factors, two indicator factors and single indicator factors which are the three scenarios that I explained in the video about factor scale setting and identification.

So what parameters do need to estimate? We need to estimate factor loadings. We are going to be scaling each latent variable using the first indicator fixing technique so we will estimate factor variances and factor covariances and indicator error variances.

And the model is identified for using the following approach. We need to set the scale of each variable, each latent variable. We use the first indicator fixing and so we fix first indicator at one. That's the default setting. So we don't have to specify it anyhow here and then we need to consider how the three two and one indicator rules are applied.

So we have these three indicator factors. They're always identified. We have two indicator factors. They're identified because they are embedded in a larger system factor. So we have these two indicator factors where we can use information from other factors to identify those loadings so we don't have to do anything special and then for one indicator factors we fix the error variances to be zero. So we say that these single indicators or single indicator factors are perfectly reliable. So we say that the error variances are zero for indicators that are sole indicators of their factors.

So in as a path diagram the result looks like that. So we have factor variances here or factor covariance is here - these curves. We have factor variances this curve that start from a factor and then comes back to the factor. We have factor loadings - these arrows are from factors to the indicators and then we have indicator error variances these curved arrows here. Then these dashed arrows are something that has been fixed. So that's constrained to be one and that's constrained to be zero. So that's a single indicator factor's error variance is constrained to be zero.

So that's what we have and there are funny things. So we can see here that we have some error variances that are negative. So this is a Heywood case and I have another video explaining what a Haywood case is and why it occurs. So we have negative variances - they are close to zero so we can conclude that maybe these indicators are just highly reliable and the error variance is actually close to zero and because of sampling error we get small negative values. So these are small negative values. We don't really care about that. We assume that they are highly reliable instead of this being a symptom of model missspecification.

Then I say that these results mostly match what's reported in the paper. So there's a small mismatch in the factor loadings but otherwise these factor loadings here match exactly what the article reports.

In text form there are outputs. A couple of things for us. So we have estimation information first, we have the degrees of freedom and we have chi-square that I'll explain in the next video, then we have the actual estimates and the estimates list we have estimate standard error, Z value and P-value and this goes on for - it's a lot very long printout - and then we have some warnings.

So the warning here is that we have the Heywood case so both of these warnings relate to that. Let's take a look at the estimation information part next. So this is the same kind of information, it's given to you by any structure regression modeling software. So it's not exclusive to R. You will get this estimation information and an actual estimates.

Let's take a look at the estimation information and the decrease of freedom first. So the degrees of freedom is 147 and that's the same as in the reported article. So where does that 147 come from? This is a good exercise to calculate the degrees of freedom by hand because then you will understand what was estimated and there's a nice paper by Cortina and colleagues were they calculate these degrees of freedoms from published articles and they check whether they actually match in the reported degrees of freedom and they don't always match, so that's an indication that there is something funny going on in the analysis.

Let's do the degrees of freedom calculation. So where does the 147 come from? We have first 231 unique elements of information. So we had the correlation matrix all the indicators has 231 unique elements. So that's the amount of information. Then we start to substract things that we estimate. So we estimate 10 factor variances. So we have 10 factors. Each factor has an estimated variance. Then we estimate 45 factor covariances. So 10 variables have 45 unique correlations. Then we subtract 11 factor loadings. So remember that when we always fix the first loading to be 1 to identify the factor so we had 21 indicators - 10 are used to writing for scaling the factor then we estimate 11 loadings. Then we have 18 indicator error variances. We had 21 indicators but three are single indicator factors so we have to fix the error variance to be zero and that gives 147. So that's the degrees of freedom. We can check that our analysis actually matches what was done in the paper by comparing the degrees of freedom and also comparing the chi-square.

The 147 degrees of freedom tells us that we have excess information that we could estimate 147 more parameters if we want to. After 147 parameters we have used all information or we couldn't estimate anything anymore. We can also use the excess information to check if the excess information matches the predictions from our model and that is the idea of model testing. So we can use the redundant information to test the model. So we have more information than we need for model estimation. We can ask whether the additional information is consistent with our estimates. If it is then we conclude that the model fits the data well.

So the idea of model testing is that we have the data correlation matrix here - so that's the first six indicators - then we have the implied correlation matrix here and then we have the residual correlation matrix here. Again the estimation criterion was to make this residual correlation matrix as close to all zeros as possible by adjusting the model parameters that produce the implied correlation matrix. And these are pretty close to zero and if our model fits the data perfectly it means that it preproduces the data perfectly or residuals are zero. And we want to know if the model is correct for the population. So the question that we ask now is whether this model would have produced the population correlation matrix if we had access to that actual population correlation matrix. In small samples the actual sample correlations are slightly off so they're not exactly in the population values and therefore the residuals are not exactly at zero. So we ask the question are these differences from zero small enough that we can attribute them to chance? So is it plausible to say that the model is correct but it doesn't reproduce the data exactly because of small sample fluctuations in the correlations?

This question, can these residual correlations be by chance only is what the chi-square statistic quantifies. So we have the chi-square statistic here. It's a a function of these residuals and it doesn't really have an interpretation but it's distributed that chi-square with 147 degrees of freedom and we can calculate the p-value for it. The p-value here is 0.25. So we say that if the residuals were all 0 in the population then getting this kind of result or greater, by chance only, we would get 25% of the time. So we then cannot reject the null hypothesis. The null hypothesis is that these are by chance only. We cannot reject the null hypothesis therefore we say that the model fits the data well. This is the logic of the chi-square testing in confirmatory factor analysis and structural regression models.

So we want to say that these differences are small enough that we can attribute them to chance only and we accept the null or actually we fail to reject the null. So then we conclude that this evidence does not allow us to conclude that the model is misspecified. So we want to have a p-value here that is non significant because it indicates that our model is a plausible representation of the data and we conclude that the model fits.

Let's take a look at the estimation information again. So estimation information gives us the p-value, the degrees of freedom and chi-square statistic. Then we get estimates and then we get these warnings. So every time when you get warnings then you need to actually look at what the warnings mean. So here our R code actually tells us that we should run inspect(fit, 'theta'). So theta matrix is the error correlation or the residual indicator error term covariance matrix estimated from the data. And we should investigate it. So recall that we have the Heywood case. We have these three negative error variances and then when we do inspection of the theta matrix - so the theta matrix contains here the estimated error term variances, so estimated indicator error term variances, all the covariance within the error terms are constrained to be 0, because we didn't estimate them in this model. And we can see here that we have these three negative values here.

So what do we do with that? We conclude that these are so close to zero that it's plausible that there are actually small positive numbers but this is just a small sampling fluctuation outcome.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Confirmatory factor analysis example (13:43)

Opiskelijoille

Opettajille

Palvelusta