TU-L0022_aalto-CUR-141790-3063741: Exploratory factor analysis example (12:01)

Exploratory factor analysis example (12:01)

Få ett betyg

Example of exploratory factor analysis workflow based on the article by Mesquita and Lazzarini, 2008. Explains factor rotation, factor loadings, uniqueness, and Harman single-factor test.

Click to view transcript

Let's take a look at an empirical example of exploratory factor analysis. To do that we need some data, and our data comes from the research paper by Mesquita and Lazzarini from 2008. This is an interesting paper because the authors present the full correlation matrix of all the indicators in the paper. That means that we can replicate everything that authors do using the correlation matrix and we also get the same result for all the analysis. So this is a completely transparent paper that we can replicate ourselves.

This article uses confirmatory factor analysis and structural regression models but we can equally well do an exploratory factor analysis to see if we get the same result as the authors did.

So this is the data set that we have. And it's the Table 1, descriptive statistics and correlations, except instead of on a scale level, it is on the indicator level. We will be using all questions that are measured on the one-to-seven scale to eliminate any scaling issues from the data. So we have five scales, these five here. And the indicators are: three indicators for horizontal governance, three indicators of vertical governance, three indicators of collective sourcing, two indicators for export orientation, and three indicators for investment. Whether these indicators measure what the authors claim they do measure is a question that we will not address in this video. We'll just take a look at whether for example these export orientation indicators can be argued to measure something together that is distinct from the other indicators. So we have 14 variables and we want to assess whether they measure five distinct things.

In an exploratory factor analysis, when we start the analysis we have to define how many factors we extract. So one way to do that decision is to use a tool called scree plot. So the idea of a scree plot is that we extract components from the data and then we have a variable here that quantifies how many variables for the variance each component explains.

Some rules of thumb on how to choose the number of factors is that we can either choose 5 factors based on a pivot point. So a clear pivot point when the curve starts to go flat means that that's the number of factors that we should extract. Another rule of thumb is that we go as long as we get these eigen values more than 1 which would be 4 factors. But here we know that this set of indicators is supposed to measure 5 distinct things, so we can use the best rule of thumb which is our theory, and theory states that we take 5 factors because we have 5 different things that we want to measure.

So we apply factor analysis. We request 5 factors using these 14 indicators. We get the result printout from R. So what does the printout tell us? There are different sections. There are 3 different sections. The first section is the factor loadings. So these statistics tell how strongly the indicators are related to each factor and how much uniqueness there is in the indicators that the factors don't explain. The second section is the variance explained, how much each factor explains the variation, and then finally in the table or in the bottom section we have different model quality indices. I don't typically myself interpret these model quality indices because if I want to really know if the model fits the data well or not, I will do it with the confirmatory factor analysis-based techniques which have a lot more diagnostics options available. So in practice we interpret the factor loading pattern, how strong the individual loadings are and how much variations the factors explain. If you want to do more diagnostics, then it's better to move into the confirmatory factor analysis family of techniques.

So the factor loadings here provide us some information. They provide us information about how strongly each indicator is related to each factor. The factor loadings are regressions of items on factors. So it's a regression path, it's a directional path because this is a standardized factor analysis solution, and the factors are uncorrelated in this factor solution which they are by default, then the loadings are also equal to correlations. So this last item correlates at 0.75 with the second factor.

Then we have also the uniqueness here or the communality h-squared which tells how much of the variation of the indicator all the factors explain together and uniqueness how much of the variance of the indicator remains unexplained. Sometimes the uniqueness is interpreted as evidence or a measure of unreliability. So if uniqueness is 30%, we say that the indicators error variance is 30%, 70% is the reliable variance. The problem with that approach is that the uniqueness also captures other sources of unique variation that is not random noise. So for example, there's probably something unique in total quality management item that is not related to other investment items, that would be reliable if we ever asked the same question again. So factor analysis puts the unreliability variance, the random error, and the unique variance into one same number and there is really no way of taking them apart. So that's one weakness of factor analysis.

The variance explained here shows that the first factor explains most of the variation but this is an unrotated solution, so we don't really pay much attention to this, except for one thing. So we can do a Harman's single factor test, which you sometimes see reported in papers. The Harman's test involves checking whether the first factor explains majority of the data - of the variance in the data - and whether it dominates over the other factors. So we can see here the first factor is 25 percent the second factor is 16 percent. We can't say that the first factor would explain most of the data. We can't say that it will dominate over the other factors because 25 and 16 percent are still in the same ballpark.

The Harman's single factor test is a bit misleading in its name because it's not really a statistical test, and it's not even a very good diagnostic because it will probably detect only very severe method variance problems. Nevertheless, it's something that you can easily check from the results of exploratory factor analysis. If you want to do more rigorous tests of method variance, then you can apply confirmatory factor analysis-based techniques that allow you much more degrees of freedom on what you can do.

Let's take a look at the factor loadings. The idea of factor loadings is that they should show a pattern. So we should see that the indicators that are supposed to measure the first three indicators - they're supposed to measure one thing - should load on one factor and one factor only, and then the measures of the other constructs should not load on that factor. So it's not the case here, and the reason why it's not the case is that this is an unrotated factor solution. So typically in a factor analysis when we extract the factors, we take the first factor that explains the majority of the data and if the constructs that cause the data are correlated, then the first factor contains a little bit of every construct. So all indicators load on it highly, and we can't really interpret it.

So we do a factor rotation and factor rotation simplifies the factor analysis results. It also has another nice feature. Factor rotation can relax the constraint that all the factors are uncorrelated when we do the factor analysis. The zero correlation constraint, there's a technical reason why we have it, and it doesn't make any theoretical sense if we are studying constructs that we think are related. So if we think that the constructs are related causally or otherwise, we cannot assume that the constructs are uncorrelated. Therefore, imposing a constraint that two factors that are supposed to represent those constructs are uncorrelated doesn't make any sense. That's another reason why we rotate the factors which relaxes that constraint.

The factor rotation simplifies the result and after rotation we can see that the first three indicators go to one factor the second three to another factor. So we have a nice pattern that each group of indicators loads on one factor only and there are no cross loadings. So this would be evidence that these indicators, for example, these three indicators, measure the same thing together, and it is distinct from what these other indicators may measure. So you want to have this kind of pattern and it is indication of validity. Of course it doesn't guarantee validity because it doesn't tell us what these indicators have in common but it's some kind of indirect evidence that there could be one construct driving the correlations between these indicators.

Another thing that we look at from these factor loadings is their magnitude. So that's what we do when we assess the results. And this is an example from Yli-Renko's article. They have a table of factor loadings. They have the measurement items. They have labeled the factors. So usually you label the factors with the construct's names and then you look at the loadings. So the factor loadings here are interpreted as evidence of reliability. So the square of factor loading is an estimate of the reliability of the indicator, and then we also have these statistics - z-statistic that is used for testing the significance, whether the loading is zero or not. I don't think the null hypothesis that loading is zero is very relevant. So you want to really know whether the indicators are reliable enough, not whether their reliability differs from zero. So this is not a very useful test, but people still sometimes present it. The first indicator here is not tested. The reason for this is that this is from a confirmatory factor analysis and there's a technical reason why the first indicator is not not tested here. I'll explain that in another video.

Then the authors say that the standardized loadings are all about 0.57 and the cutoff is 0.4. The commonly used cutoff is 0.7 but you can probably find somebody who has presented a lower cutoff if you do that kind of cherry picking. But normally we want the loading to be 0.7 but reliability again is a matter of degree, it's not a matter of yes or no and you have to then assess what the unreliability means for your study results.

Det här innehållet visas i förhandsgranskningsläge. Ingen spårning av försök kommer att lagras.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Exploratory factor analysis example (12:01)

Students

Teachers

Service