TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Measurement validity and validation (20:12)
This video introduces some of the complexities of measurement validity and validation
Click to view transcript
Reliability
and validity are two important characteristics of good measurement.
Reliability is fairly straightforward to evaluate and fairly
straightforward to define because it is simply whether you get the same
result over and over if you repeat the same measurement. Then you can
use that consistency with repeated measures to calculate an estimate of
reliability. So that is fairly straightforward. The issue of validity is
much more complicated.
Validity refers to whether your indicators
measure what they're supposed to measure. The problem is that because we
cannot observe the thing being measured directly - we cannot really
statistically assess whether the indicators correspond to the attribute
or the trade or the construct that we want to measure.
So
validity and validation are complicated topics and in this video, I will
introduce you to some of that complexity. One thing that makes validity
literature difficult for a person who just start reading it - is that
there are so many different terms. So measurement validity is whether an
indicator measures what is supposed to measure. That is fairly
straightforward to define. What exactly that means it gets to some
complications. But then there are all these terminologies. You have face
validity, content validity, convergent validity, discriminant validity,
nomological validity and so on.
So there are so many different
terms. Do you have to understand all these? Are these facets of validity
that all have to apply? Are they different definitions? Are they
contradictory and so on. One way to understand this literature - start
to understand this literature is to understand that there's a difference
between validity and validation. So validity refers to whether the
indicator measures what it is supposed to measure. Validation refers to
different ways that we can argue or assess validity. And these concepts
are mostly focused on validation. Danny Borsboom's article in
psychological review notes that these terms originate from questions
such as asking people whether they think that the measurement is valid.
So that's a way of validation that led to term face validity whether the
measure can predict something useful that is predictive validity. So
it's about validation more than about validity and these are two
different things.
So how do we argue validity and how do we
define validity are two different things. If you just look at the
definition of validity and then the things are much simpler because you
don't have to understand most of this. But there are important terms
that you need to understand because they are commonly used. I will now
explain three of them. These originated from psychometric text from
1960s or at least the new knowledge book from 1960s is commonly cited as
a source for these terms and that made these terms popular.
So
are these content validity, predictive validity and construct validity -
are they actually about validity or validation or are they competing
concepts or are they complementary concepts? Do you have to demonstrate
all of these in your study or do you have to focus on one? Let's take a
look at what these concepts actually mean. So these are different
things. The idea of content validity is that your indicators in your
scale measure all different aspects or dimensions of the phenomenon. A
typical example is a math exam. So if you do a math exam then it has to
cover all the content of the course. So if you have an elementary school
math exam there is a subtractions multiplications or divisions and sums
that you have to calculate. So you have four different things. If you
only cover subtractions then you lack content validity. So it's whether
the indicator summarize some dimension or some domain that the test or
exam is supposed to summarize. So content validity is mostly focused on
educational measurement or something where you have to summarize
people's capabilities or skills in a certain domain of things with a
single score.
Predictive validity is about prediction or
forecasting. And forecasting means that can you actually based on your
data say something about the future. It's not measurement. Prediction
and measurement are two different things. A typical example is college
entry exams. They are not designed to measure who is good at school who
is smart or something else. They are designed to predict who is going to
do well in the college and who is going to graduate. Because the
college is not as interested in getting people who are smart or
hard-working. It's interested in getting people who are going to
graduate.
Then we have construct validity and this is about
construct measurement but it is a special kind of validation technique.
Construct validity is not the definition of measurement validity instead
it is a validation technique and why that's the case becomes clear on
this next slide. So the idea of construct validity is that there is a
nomological network. The nomological network is a network of constructs
and their theoretical relationships. For example - the example given by
Borsboom and colleagues is that we have intelligence as our focal
construct then we have general knowledge as another construct and
criminal behavior as another construct. We have a strong hypothesis that
intelligence is negatively associated with criminal behavior and
positively associated with general knowledge.
The idea of
construct validity or construct validation is that we assess or measure
intelligence. Let's say we use an IQ score and we check if the IQ score
correlates positively with the general knowledge examination score and
negatively with the length of criminal record. So the idea is that we
have this theoretical world here - the nomological network - and we have
the empirical world here - our measured correlations - and then we
check whether the measured correlations from our data matches these
theoretical expectations. So whatever our measure is here it is valid -
construct valid - if these relationships between the measured scores
correspond to the relationships that we theorize.
This is
somewhat useful way of assessing validity. So if your scores don't
behave as expected then that's one reason to either doubt the validity
of scores or doubt the correctness of your theory. So that's useful. But
this is very limited also because consider if you have a very green
field of study. So you're studying something that hasn't been theorized
much before so where exactly would you get this nomological network. If
you're the first person to introduce a new construct to your field then
how exactly are you going to argue that that construct has an
established relationship with other constructs because there is no
existing research on that construct.
But this is... So basically
the idea of construct validity is whether these empirical correlations
are good representations or proxies of these theoretical relationships.
One important thing that construct validity and these other two commonly
used validity terms don't address is that they don't really address
what is the relationship between your data and your theoretical concept.
So content validity basically just addresses whether these data cover
the content of the thing that you're studying. So does your math test
cover all the things that was taught during the course. Predictive
validity is does this course predict something. So those two are not
about theoretical concepts at all. So predictive validity and content
validity - there is there's no theoretical concepts in their
definitions.
Construct validity has the term construct in the
name and it also concerns the theoretical concept. But it doesn't
address whether the data corresponds to the theoretical concept. It only
addresses whether the relationships between the variables correspond to
the relationships between the theoretical concepts. That is interesting
but it doesn't really address how the theoretical concepts are related
to the data. So that is beyond these terms.
So how do we define
validity? One good candidate definition is that we define test as valid
if the attribute being tested or measured exists. So we assume that the
construct exists independent of measurement and that is the realist
perspective of measurement. Then we claim that the variation in observed
data is due to the variation of the construct. So there is a variation
in the construct. Let's say there is the construct intelligence. Some
people are more intelligent than others and there is variation in IQ
scores. We say that the IQ score is a valid measure of intelligence if
the variation in the intelligence causes variation in the scores. In
other terms or other words, some people perform better in IQ tests
because they're more intelligent. Some people perform worse because they
are less intelligent. So that's the idea of variation in construct
causes variation in the observed data. And so the observed data is of
course a function of construct and some measurement error.
That's
an easy definition. What is difficult is to argue how your scores are
actually valid. So validation is the hard part. Defining validity this
way is very simple. So how exactly do you validate and what do you have
to write into your paper to convince your readers that your measures are
valid. To understand that let's take a look at - compare these latent
variable model for validity and construct validity.
So the
construct validity perspective is more about epistemology. So it's what
can we learn from the correlations in our data. Can we use the
correlations in our data to learn something about the constructs? That
is a useful way of validation but it doesn't really address whether the
test is valid. Then the latent variable theory presented in the last
previous slide is about ontology. So does the attribute exist and does
the variations in that attribute produce variation in a test score. So
these are different. The focus is slightly different.
The
concepts of focus here in construct validity is in the correlations. So
it's the meaning of what the correlations mean. Can we generalize from
observed correlation to a theoretical correlation. In the latent
variable model, the idea is on a reference. So do the indicators - the
variables - actually refer to any real entity? We have to argue that.
Then the empirical focus is correlations. In construct validity, we
check the correlations between our data and if those correlations match
with the theoretical expectations we conclude that the test is valid.
In
latent variable theory, we have to argue the causation. So validation
here is not a methodological problem but a substantive problem. So we
have to really argue why we think that our IQ test or innovation score
actually varies because the construct being measured varies. So we have
to explain ideally what is the mechanism of variation and how do exactly
person's intelligence for example influence how they do in IQ scores.
This is, of course, a lot more challenging task and it places more
emphasis on validation studies and the theoretical part of the
validation study whereas construct validation is simply about
calculating correlations and see whether they match empirical
expectations. Both are useful because if your measures don't behave as
expected that's a reason to suspect that the measures may not be valid
but ultimately that is not sufficient to claim validity. You have to
claim - look at the causal process.
We can also take a look at
how the latent variable theory differs from classical test theory which
gives us the definition of reliability. The idea is that classical test
theory is a psychometric model. It's not the measurement theory so the
scope is much more narrow. It's a model that describes how people
respond to surveys or how they respond to different psychological tests.
Latent variable theory is about measurement theory and it takes the
realist ontology. Classical test theory doesn't really say anything
about ontology so it doesn't say whether the scores measure anything. It
only gives us a reliability and true score. Then latent variable theory
is focused on validity and construct measurement. The equations for
these two models can look similar. So classical test theory is
explicitly defined as an equation so the observed scores are
deterministic linear combination of true score plus some random noise.
In
the latent variable theory this is more general. It's just saying that
variation in the construct scores causes variation in observed scores.
The statistical - there's therefore some kind of statistical association
between the construct and the measure but it may not be necessarily
linear. So we can model other kinds of relationships and this takes the
relationship - the statistical model - simply as an approximation for
the causal relationships.
Then the true score of construct
influences different indicators in classical test theory - we take it as
an assumption that the true score influences all indicators equally. So
if we eliminate all random noise in the data then all the indicators
are going to be exactly the same because they share the same true score.
This is called the Tau equivalence assumption. Tau is for the true
score in Greek.
Then here in latent variable theory, we just say
that the indicators - the various indicators - depends on the variation
of the construct but we don't really make any explicit claims about how
the dependency manifests statistically. So different indicators might
depend differently on the construct. Some may be more sensitive to
certain levels of the construct and others and this allows us to do all
kinds of statistical models particularly the IRT or item response theory
models are based on this kind of thinking.
Measurement error in
these models is... Classical test theory is simply about random noise
and individual items. Then in latent variable theory we can have all
kinds of sources of measurement error but the key thing that we have to
argue is that the construct actually is a cause of the indicators or the
variance of the construct is a cause of the variance of the indicators.
And that is much more challenging to do than simply assessing
reliability.
Here's a one very simple way that we can use this
approach. There are latent variable model to assess reliability and
validity. So if we take the assumption that linear or statistical
associations are useful for assessing causal relationships then we could
say that the observed score is a function of the construct score - we
use T here - plus some systematic measurement error plus some random
noise. So there are different causal influences to the true score or the
construct score that we are estimating with that kind of model here.
So
we have error in reliability and we also have the systematic error in
validity. The problem of course here is that if you have unique random
noise and then you have an indicator that is unique then it may be
difficult to know whether the indicators measurement there is actually
validity error or reliability error. So oftentimes you can't really say
which one it is. Then a summary of all this. We don't really have any
proofs of measurement validity. So validation is more of a substantive
argument than a statistical argument. Nevertheless, we can say that if
many indicators or two or more indicators are highly correlated then
they may be measuring the same thing. We just don't know what the thing
is and we have to argue based on a theory that the construct actually
causes certain kind of behavior in people and then that's how we argue
the validity. Then it's possible that the indicators correlate for some
other reason and if measure behaves as expected with respect to other
measures it may be valid. So that's the construct validity way of
validating things and it's a useful technique but you shouldn't rely on
it as your only technique. And typically with the latent variable
theory, you work with this kind of models so you specify one latent
variable as a source of variation of multiple indicators and this is
called the common factor model. So it's a factor analysis model and
that's commonly used with this kind of validity framework.
This
is a very complicated topic. If you want to study more about validity I
can recommend you two good books. I like the writings of D. Borsboom. So
he has written a book called measuring in the mind which is an
introductory level book. So you can read that after reading for example
the DeVellis scale development which gives you an overview. And once you
have read that book then you can look at more challenging text such as
frontiers in tests validity theory by Keith Marcus and Borsboom which
summarizes a broad range of validity literature and it's fairly
condensed. So that's probably not best for the first book but it's a
really great overview of test validity theory.