TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Scale setting and identification (17:15)
In this video the minimum
requirements for evaluating models with CFA are explained in particular
the scale setting and the identification
Click to view transcript
There are two things that we need to consider before we can even start estimating confirmatory factor analysis model called scale setting and identification. The scale setting means that every variable must have a metric. So we have to be able to estimate the variance and sometimes the mean of every variable. An identification means that the data provides enough information to estimate the model that we want to estimate. So the confirmatory factor analysis framework is very flexible and it's possible to define models that are mathematically impossible to estimate uniquely. So in this video we will go through what requirements you have to consider before you can even estimate the model meaningfully.
Let's take a look at this model with just two indicators. We have indicator a1 and a2 and then we want to estimate factor A. And we have two variances - these two error variances here - and then we have two factor loadings. So we have four things that we want to estimate, so four free parameters. Then we start estimating it. We calculate the model implied correlations. So we have two variances: variance of a3 a2 and variance of a1 and then one correlation. So we have three unique elements of information from the data that we model using these four parameters. The problem is that now we have three units of information and we have four things that we want to estimate. So the degrees of freedom is minus one and that can't be estimated or it can't be estimated meaningfully. The reason is that, or intuitive understanding of this, is that you cannot estimate four things from a three things. So that's the idea. You have to have more information than what you want to estimate. So this is not identified and there are ways that we can simplify the model to actually be able to estimate something or we can add more indicators to make it identified. So this is not identified because the degrees of freedom is negative. And factor analysis without additional constraints always requires at least three indicators. Factor analysis of two indicators only it's not a very meaningful analysis anyway because while you can make it identified by saying that these factor loadings for example are the same - that would identify the model - then the estimation wouldn't give you any meaningful information anyway.
So let's take another example or work more with this example. So let's assume that our correlation matrix for this two factor model each with one indicator, so we have a1 and b1 they're correlated at 0.1 and we have three parameters that we want to estimate. So you can't. We have one correlation that depends on three parameters and these other variances don't depend on the model or they do depend but we don't really care about those in this video. So why is the correlation with a1 and b1 so low? There are basically three different options. It's possible that a1 and b1 are both highly reliable indicators of these factors A and B. It's also possible that A and B are just weakly correlated. It's also possible that A and B are highly correlated but a1 is unreliable and therefore we observe only a small correlation. Or it's possible that A and B are highly correlated but b1 is unreliable. The problem is that we cannot know which of these three options is correct because they all have the same empirical implication, which that this correlation here is quite small. So that's another example of non identification problem. Here we are estimating five things so we have two error variances. We have two factor loadings and one factor correlation. We are trying to estimate it from just three elements of information. We can't do that. The model is not identified. We cannot know which one of these three explanations is correct empirically. Of course we can then use theory and rule out one of these base alternate explanations - based on theory - but that goes beyond our factor analysis estimates and identification. So this model is not identified. It cannot be estimated meaningfully.
Let's take a look at scale setting now. So the identification basically means that you have more information than what you estimate. So the number of unique elements in the correlation matrix of the indicators must exceed or be the same as the number of three parameters that you estimate from the model. Okay. So normally we have - in exploratory factor analysis we have standardized factors - so the idea is that all the factors have variances of 1 means of 0 in the exploratory analysis and that defines the scale of these variables. So every variable must have a variance in exploratory analysis. The factors are scaled to have unit variance, so they're standardized and then all the factor loadings are then standardized regression coefficients for that reason. Then, what if we don't standardize the factor, so we are saying that instead of saying that the factors variance is 1 we are estimating the factors variances. So we add these factor variances here and factor variance here so we have 15 free parameters. We still have 21 units of information from which we estimate but we estimate 15 different things so the degrees of freedom is 6 which means that this model is overidentified. So it's positive. So in principle it is possible to estimate this model meaningfully. We can do the estimation. So let's assume that that's our observed correlation matrix. That's our implied correlation matrix. Then we can find the values for the phis and the lambdas. So that this employed matrix reproduces this correlation matrix perfectly. In this case that's possible because these correlations all have the same values. Generally in small samples you will never completely reproduce the data but in this example you do just to simplify things. So we can estimate and that's one set of estimates that will give you the exact fit between the observed variable, observed correlation matrix and the implied correlation matrix. So we're fine right. Turns out we have a small problem because there's another set of estimates that also reproduce the correlation matrix perfectly using the implied correlation matrix. So you can plug in these values to the equations and see that they produce the exact same implied correlations. So we have here factor a variance is 1 versus factor b variance 2 and therefore they are produced the same fit. So what do we do? We can go and come up with indefinitely many examples. So if factor a variance is 0.5 then we will all have a different values with factor loadings but still the empirical correlation matrix is reproduced perfectly using the model implied correlation matrix. So, this is the problem of scale setting of latent variables in confirmatory factor analysis models. So we need to set the metric. So the factors themselves because we don't observe the factors they are just arbitrary entries we don't know whether they vary from 0 to 1 or 0 to 1 million or minus 5 to plus 10 or whatever. We don't know their range. We don't know their variances. We don't know their means. We have to specify the scale of each factor ourselves.
In exploratory analysis we typically don't model means and then we assume that the variances or we fix the variances of the factors to be ones. In confirmatory analysis there are reasons why we don't fix the variances to ones. That I'll explain a bit later. But the problem generally is that we must define whether we are talking about centimetres or inches or do we talk about Celsius or Fahrenheit. They quantify the same exact thing and they are equally good measures from a statistical perspective to measure length or temperature. We have to agree on what is the scale that you're using. So also a regression gives us the one unit change: the effect of one unit changing in the independent variable on the dependent variable. Considering regression coefficients only makes sense after we have considered how we define the unit. So what is the unit of A and what is the unit of B. We have to set them manually. So we have to decide a scale setting approach. In exploratory analysis as I said we typically say that factor A and factor B on all factors have variances of one. That produces standardized factor loadings, which are standard that regression coefficients of the indicators on the factors or, in the case of uncorrelated factors, they equal correlations. We use that in exploratory factor analysis. We cannot use that in structure regression model. Structure regression model is an extension of a factor analysis model where we allow regressing relationships between the factors. The reason why we can't use this approach is that the variation of an endogenous variable, so a variable that depends on other variables, is the sum of those other variables. So we can't say the variable's variance is 1 if that variance depends on other things in the model. But that's that's beyond this video.
Another very common approach is that we set the first indicator to be fixed the first indicators loading to be one. And this is the default scale setting approach in most structural regression modelling or confirmatory factor analysis software. The reason is that this can be used pretty much always regardless of what kind of variables we have here as A and B and what kind of relationship will be specified between A and B. And the idea is that we scale that - if we assume that classical test theory holds - so all these errors here are just random noise - then the variance of A is whatever is the variance of the true score of a1. So that's also appealing if we consider that the only source of error is random noise, then the variance of factor A is the variation of a1 or what the various in a1 would be if it wasn't contaminated with this random noise here. So that's also one way, one reason why this is appealing. It allows us to consider the scale of these indicators without error variance, assuming classical test theory holds for the data. And this is such a common approach that there's a rule of thumb that I present: always use the first indicators to fix the scale.
We can see that the papers, that we have used as examples in these videos, are using this approach. Mesquita and Lazzarini , you can see all loadings of first indicators are ones. So they set the scale of the latent variable by fixing this loading to one and then they have the Z-statistic here. You can see that the indicators - the first indicators - don't have a Z- statistic. The reason is that they are not estimated from the data - instead a researcher says that these are ones they are not estimated if something is not estimated it doesn't vary from sample to sample. So it doesn't have a standard error. So we can't calculate or the Z-statistic for it. We can see the same in Yli-Renko's paper. So Yli-Renko's paper: the first loading it's not one but it doesn't have a standard error and he doesn't have a Z-statistic he doesn't have a standard error so that's indication that they actually fix the first loading to be one to identify or the scale the latent variables. If you want to have standardized factor loadings, so if you want to have loadings that are expressed in the scale of the exploratory analysis where the factor variances are ones, then you can re-scale the confirmatory factor analysis results afterwards. Your software will produce that for you if you check the standardized estimates option there. So these are standardized estimates but the scaling has been done after estimation. So you first estimate and unstandardized confirmatory factor analysis where each factor is scaled by fixing the first indicator, then you scale the resulting solution. That's the same approach that you use for standardized regression coefficients. You first estimate regression then you scale the parameter estimates later. So the summary of identification of confirmatory factor analysis models: a model is identified if every latent variable has a scale and if the degrees of freedom is positive and also every part of the model has to be identified.
In confirmatory factor analysis, after we have established every latent variable every factor has a scale, then all factors with three indicators are always identified. So three indicators if you have three variables you can always run a factor analysis no matter what. Then if you have two factors, then we can either say that both are equally reliable. So we fix the factor loadings to be ones or we can embed this factor in a larger system. So just two variables alone we can't estimate a factor model unless we fix these factor loadings to be the same. If we embed this two factor, the two indicator factor, into a larger factor analysis, then we can estimate because we can use information from other indicators to estimate these factor loadings. And one single indicator rule, if we have a factor with just a single indicator, then we cannot estimate the reliability of the indicator because you cannot estimate reliability based on just one measure. That's the idea. We have to assume what is the error variance and typically we do that by constraining the error variance to be zero. So we say that this factor A or construct A is measured without any error if we can't estimate it. Of course we could constraint the error variance to be something else. If we know that the indicator has typically shown to be eighty percent reliable, then we can fix this variance here to be a 80 percent of the observed variance of the indicator but that's rarely done.
So identification is a requirement for estimation. If our model is not identified it cannot be meaningfully estimated. Identification basically means that do you have enough information to estimate the model. If we have one correlation we can't estimate two different things from one correlation. You need at least one unit of information for everything that you estimate ideally you have more information so redundancy. So we need to have a scale for inlatent variables and the degrees of freedom must be non-negative. Ideally it is positive and the more positive it is the better our model tests are.