TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Introduction to factor analysis (17:20)
Description to be added.
This video explains factor analysis as a very useful tool for validating measurement.
Click to view transcript
Factor
analysis is a very useful tool for validating measurement. The idea of
factor analysis is that they take in multiple indicators and then it
answers the question what do the indicators have in common. So it tries
to extract or identify underlying dimensions from your data. The reason
why we use factor analysis for measurement is that before we apply any
reliability statistics we have to study if the indicators are
uni-dimensional - if so then we use a uni dimensional reliability index -
if not then we calculate the reliability statistic based on the factor
analysis.
Factor analysis also can be used to assess the
hypothesis that the indicators are consequences of a common cause and in
that way we can justify - try to use the factor analysis to justify
causal claims where we say that the construct causes multiple items. The
factor analysis techniques are - there are two main variants:
exploratory factor analysis and confirmatory factor analysis.
In
exploratory factor analysis the core - it's an exploratory process where
you give the computer your dataset and then you ask the computer to
give you three factors two factors or how many factors you can - you
want to have from the data - and then the computer will identify the
factors. In confirmatory factor analysis you specify the factor
structure yourself. So you say the first three indicators for example
measure one thing that is one factor then the second three measure
another thing that's a factor and then the remaining four indicators
measure a third thing and that's up the third factor and then the
computer will estimate the model for you and tell if that model is
plausible for the data.
Exploratory factor analysis is easy to
apply because you don't have to specify the structure yourself - you
just specify the number of indicators and which variables you use and
for that reason many people get started with the exploratory factor
analysis instead and if you do data exploration or some initial analysis
then exploratory factor analysis is quicker to do for your data. Then
so exploratory analysis is the one that is typically covered first
followed by confirmatory factor analysis.
I will now demonstrate
factor analysis using the exploratory approach and to do that we need
some data and our data are from Olympic decathlon. So we have the ten
sports that the athletes do that are 100 meters run, long jump, shotput,
high jump, 400 meters run, 110 meter hurdles, discus throw, pole vault,
javelin throw and 1500 meters run. So there are 10 different sports
that you do in this competition and then you are rated based on your
performance and all. And the overall ranking is determined by this
course. So you have to be a very good overall athlete to be able to do
decathlon. So the data looks like this. So that's the first 15
observations. The 100 meters is seconds, long jump how many meters,
short put how many meters, high jump how many meters, 400 meters run how
many seconds, 110 meter hurdles how many seconds, discus throw how far
in meters you threw it, pole jump how high how many meters, javelin how
many metres you threw the javelin and then how many seconds was the one
and a half kilometer run.
So what kind of dimensions does this
data have? That's what factor analysis will tell us. And we'll first do a
factor analysis and we'll request two factors just to get started with
something. So that's the two factor solution and before I explain the
factors it's important to understand what do these numbers tell us. And
let's start with uniqueness and communality.
So uniqueness and
communality are sum to 100 or 1. And uniqueness or communality first
tells how much of the variation of this particular indicator the two
factors explain. So for example short put there are factors explained
ninety four and half percent of the variation and only 0.5% is
unexplained. So the uniqueness is how much of the indicator remains
unexplained by the factors.
Ideally if the factor model is
correctly specified - so that the factors perfectly match your
theoretical constructs and the indicator - there are no systematic
measurement errors then this uniqueness here quantifies the amount of
random noise in the indicators. That's an ideal case. Whether that
applies in any real case that's another question. So this is... The
commonality is kind of measurement of reliability and this is an
estimate of unreliability. So that's one way. Then we have two factors.
We have MR 1 and MR 2. The MR simply comes from the fact that we
estimated min res technique you don't have to care about what that
means. So we have a first factor and second factor. These are called
factor loadings. And they are in correlation metric here. So the idea
here is that the first indicator correlates at minus 71 with the first
factor and minus 0.22 with the second factor. So the first indicator -
the first variable is very strongly associated with the first factor and
then a bit more weakly associated with the second factor. So let's just
take a look at the first factor now. The first factor here we first -
we identify that some of the indicators have negative factor loadings.
We have to understand why that is the case. If we start to look at those
items that have negative loadings - we have the 100 meter run, we have
the 400 meter run, we have the 110 meter hurdles and then we have the
1500 minute run. So all these are running sports and what they have in
common is that more time means that you're worse. The less time means
you're better. With all these others you are throwing something or
you're you're jumping and the more is better. So in these running sports
less time is better - in these others more distance more height is
better.
To make the results a bit more understandable I will
therefore now reverse score the times. So that all variables indicate
more of a variable indicates that the person - the athlete performs
better. So I will reverse the signs of these all running sports and then
we have this kind of factor analysis result. We can see that every
factor - every indicator here - loads positively on the first factor and
the magnitude of the factor loadings differ. So how would we interpret
the first factor? All indicators are positively associated with
something. What's the thing? We have to interpret what is the underlying
dimension that these influences - these dimensions these indicators and
variables according to these results. This first factor - if everything
correlates positively with the first factor - then the first factor
basically is how good the guy is. So how good of an athlete - the person
is. If you are good athlete then you perform better in all of these
sports. So good athletes are expected to perform better than bad
athletes. Therefore all the items are positively correlated. The second
attribute .- second factor here - we can see that there are short put
and javelin and these two are positively associated. 1 500 meters
negatively associated as is all the other running sports. So the second
factor quantifies whether the person is better at sports that require
strength versus the sports that require running speed. So there is a
trade-off if you are very bulky guy - you're good in these strength
sports but you're more mass therefore you're not that great in the
running sports. So there's a trade-off and this second factor quantifies
that trade-off. So we have a factor how good a guy is and we have a
factor of whether the guy is better at running or strength sports.
That's not... We would ideally like to think that there are two
dimensions to this data. How good the guy is in running and how good the
guy is in these sports that require strength. But this factor analysis
solution doesn't answer that question. To answer that question we do
something called factor rotation. So the factor rotation is a technique
that reorients the factor solution so that it's simpler to interpret.
Typically
when you apply a factor analysis and you have two correlated dimensions
then the first factor will capture a little bit of both dimensions.
Like we have running speed and strengths captured by the factor how good
the guy is and the second factor will captured then whether the guy is
better at running or whether at sports. When we reorient the factor
analysis using factor rotation then the factors will typically
correspond better to actual dimensions in the data. So here after
rotation we have the first factor strongly associated with all the
running sports. So we have 0.84 here 0.7, 0.6 and so on. And then the
second factor is strongly associated with sports that require strength
like the discus and the shotput. We can see that in a bit better by
reordering these indicators. So we reorder based on the first factor and
we can see that the running sports are all the five largest loadings.
Then we have the pole jump and then we have the strength sports here.
The shotput, javelin and discus throw. The first factor now clear has an
interpretation. It is related to running. So that's the running skills
or how good a runner you are. And the second factor is a clear
interpretation - it's related to these strength sports and it's upper
body strength. The pole vault requires both so it's loading both. This
is called a cross loading because it loads on two factors. First you
have to run and then you put the pole into the hole and then you have to
use the upper body to use the pole and get as high as possible. So pole
vault requires both skills. We can see here also that high jump is a
high uniqueness. So it's not really related to upper body strength at
all and it's not really related to running speed because you don't have
to run fast you just run to pace yourself and then you jump up. So
jumping up is different from running fast. In long jump you have to -
the better you are running the faster you can get yourself going and the
faster - the further you will jump - fly when you jump. So that
requires running. And this way we can interpret the meaning - give
meaning to these factors.
So that was a two-factor solution. We
can of course get more than two factors. So there's quite a lot of
unexplained variation here. So a high jump 90 percent variation is
unexplained by these two factors. So we can try extracting more factors.
And whether it makes sense to do so is related to more what's your
theoretical expectation and can you actually interpret the factors
instead of a statistical question of whether we can explain more
variation between the indicators.
There are statistical
techniques to decide the number of factors but it is theoretical concern
and it's about whether you can interpret the result anymore. Let's try
three factors and see what happens. So that's the rotated solution and I
have ordered the variables again according to the first factor loading
and then the second factor loading. So we have three factors now. The
first factor is the same running speed then the second factor is the
same upper-body strength. So we have the strength sports here and then
we have a third factor that has the 1 500 meter run and the 400 meter
run and the long jump and not much else. So it's not about running speed
as much as it's about running stamina. So it's slightly different. So
this is whether you're good at running short distances that's explosive
running speed and how fast you accelerate things like that. And this is
whether you can keep up the running. And the upper-body strength is the
same. So we can divide running further into two sub dimensions. Whether
it makes sense to do so is another question. In this case probably not.
Probably it's better to just say that some people are better at strength
sports and some people are better at running sports. We can also get
four factors. We get the same factors: running speed, upper body
strength, running stamina and then the final factor is simply high jump.
So that receives its own factor and nothing else slows on the high jump
factor. So when we start extracting factors typically we can go and get
as many factors as we have indicators and eventually we will get these
factors that just explain a single indicator and nothing more.
So
the idea of a factor is to try to find an underlying dimensions from
the data and once we start to get these factors that just tell that how
good the guy is in high jump - then it's not really a factor anymore in
the sense that it's an underlying dimension. So probably with this data
three factors - if we're really interested with the running stamina and
running speed difference -could be a good solution or we could just take
the two factor solution which measures the running skills and the
strength of the athlete. So it's an argument. The choice of factors
depends on what's your research question and what kind of abstraction do
you want to have for your data.
In practice when we apply factor
analysis to measurement scales - for example surveys - then we want to
measure five different things with the survey then we set the number of
factors to five because we want to get five things from the data and
ideally the factor analysis demonstrates that the indicators correspond
to the theoretical constructs that they're supposed to measure.
Factor
analysis is based on the correlation. So it is important - it's useful
to understand the relation between correlation matrix and factor
analysis. The model implied correlations - the same principle applies
here as in regression model I'll cover that a bit later. But here we can
see that factor analysis groups the indicators based on the
correlations. So we have here first the running speed factors. All the
running sports are highly correlated. So they are reflections of one
underline running speed factor. Then we have these others. We have the
upper body strength. So those sports that require upper body strength
are highly correlated. Then we have the running stamina factor. So some
of the running sports require both endurance and speed. And then 1500
run requires endurance more than speed. And then we have high jump which
is not loading on any factors because it is very - really uncorrelated
with any other sport. High jump is a unique sport in that it doesn't
really require strength and it doesn't require speed. It requires the
capability to just jump very high.