TU-L0022_aalto-CUR-141790-3063741: Measurement validity and validation (20:12)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Measurement validity and validation (20:12)

Vaatii arvosanan

This video introduces some of the complexities of measurement validity and validation

Click to view transcript

Reliability and validity are two important characteristics of good measurement. Reliability is fairly straightforward to evaluate and fairly straightforward to define because it is simply whether you get the same result over and over if you repeat the same measurement. Then you can use that consistency with repeated measures to calculate an estimate of reliability. So that is fairly straightforward. The issue of validity is much more complicated.
Validity refers to whether your indicators measure what they're supposed to measure. The problem is that because we cannot observe the thing being measured directly - we cannot really statistically assess whether the indicators correspond to the attribute or the trade or the construct that we want to measure.

So validity and validation are complicated topics and in this video, I will introduce you to some of that complexity. One thing that makes validity literature difficult for a person who just start reading it - is that there are so many different terms. So measurement validity is whether an indicator measures what is supposed to measure. That is fairly straightforward to define. What exactly that means it gets to some complications. But then there are all these terminologies. You have face validity, content validity, convergent validity, discriminant validity, nomological validity and so on.

So there are so many different terms. Do you have to understand all these? Are these facets of validity that all have to apply? Are they different definitions? Are they contradictory and so on. One way to understand this literature - start to understand this literature is to understand that there's a difference between validity and validation. So validity refers to whether the indicator measures what it is supposed to measure. Validation refers to different ways that we can argue or assess validity. And these concepts are mostly focused on validation. Danny Borsboom's article in psychological review notes that these terms originate from questions such as asking people whether they think that the measurement is valid. So that's a way of validation that led to term face validity whether the measure can predict something useful that is predictive validity. So it's about validation more than about validity and these are two different things.

So how do we argue validity and how do we define validity are two different things. If you just look at the definition of validity and then the things are much simpler because you don't have to understand most of this. But there are important terms that you need to understand because they are commonly used. I will now explain three of them. These originated from psychometric text from 1960s or at least the new knowledge book from 1960s is commonly cited as a source for these terms and that made these terms popular.

So are these content validity, predictive validity and construct validity - are they actually about validity or validation or are they competing concepts or are they complementary concepts? Do you have to demonstrate all of these in your study or do you have to focus on one? Let's take a look at what these concepts actually mean. So these are different things. The idea of content validity is that your indicators in your scale measure all different aspects or dimensions of the phenomenon. A typical example is a math exam. So if you do a math exam then it has to cover all the content of the course. So if you have an elementary school math exam there is a subtractions multiplications or divisions and sums that you have to calculate. So you have four different things. If you only cover subtractions then you lack content validity. So it's whether the indicator summarize some dimension or some domain that the test or exam is supposed to summarize. So content validity is mostly focused on educational measurement or something where you have to summarize people's capabilities or skills in a certain domain of things with a single score.

Predictive validity is about prediction or forecasting. And forecasting means that can you actually based on your data say something about the future. It's not measurement. Prediction and measurement are two different things. A typical example is college entry exams. They are not designed to measure who is good at school who is smart or something else. They are designed to predict who is going to do well in the college and who is going to graduate. Because the college is not as interested in getting people who are smart or hard-working. It's interested in getting people who are going to graduate.

Then we have construct validity and this is about construct measurement but it is a special kind of validation technique. Construct validity is not the definition of measurement validity instead it is a validation technique and why that's the case becomes clear on this next slide. So the idea of construct validity is that there is a nomological network. The nomological network is a network of constructs and their theoretical relationships. For example - the example given by Borsboom and colleagues is that we have intelligence as our focal construct then we have general knowledge as another construct and criminal behavior as another construct. We have a strong hypothesis that intelligence is negatively associated with criminal behavior and positively associated with general knowledge.

The idea of construct validity or construct validation is that we assess or measure intelligence. Let's say we use an IQ score and we check if the IQ score correlates positively with the general knowledge examination score and negatively with the length of criminal record. So the idea is that we have this theoretical world here - the nomological network - and we have the empirical world here - our measured correlations - and then we check whether the measured correlations from our data matches these theoretical expectations. So whatever our measure is here it is valid - construct valid - if these relationships between the measured scores correspond to the relationships that we theorize.

This is somewhat useful way of assessing validity. So if your scores don't behave as expected then that's one reason to either doubt the validity of scores or doubt the correctness of your theory. So that's useful. But this is very limited also because consider if you have a very green field of study. So you're studying something that hasn't been theorized much before so where exactly would you get this nomological network. If you're the first person to introduce a new construct to your field then how exactly are you going to argue that that construct has an established relationship with other constructs because there is no existing research on that construct.

But this is... So basically the idea of construct validity is whether these empirical correlations are good representations or proxies of these theoretical relationships. One important thing that construct validity and these other two commonly used validity terms don't address is that they don't really address what is the relationship between your data and your theoretical concept. So content validity basically just addresses whether these data cover the content of the thing that you're studying. So does your math test cover all the things that was taught during the course. Predictive validity is does this course predict something. So those two are not about theoretical concepts at all. So predictive validity and content validity - there is there's no theoretical concepts in their definitions.

Construct validity has the term construct in the name and it also concerns the theoretical concept. But it doesn't address whether the data corresponds to the theoretical concept. It only addresses whether the relationships between the variables correspond to the relationships between the theoretical concepts. That is interesting but it doesn't really address how the theoretical concepts are related to the data. So that is beyond these terms.

So how do we define validity? One good candidate definition is that we define test as valid if the attribute being tested or measured exists. So we assume that the construct exists independent of measurement and that is the realist perspective of measurement. Then we claim that the variation in observed data is due to the variation of the construct. So there is a variation in the construct. Let's say there is the construct intelligence. Some people are more intelligent than others and there is variation in IQ scores. We say that the IQ score is a valid measure of intelligence if the variation in the intelligence causes variation in the scores. In other terms or other words, some people perform better in IQ tests because they're more intelligent. Some people perform worse because they are less intelligent. So that's the idea of variation in construct causes variation in the observed data. And so the observed data is of course a function of construct and some measurement error.

That's an easy definition. What is difficult is to argue how your scores are actually valid. So validation is the hard part. Defining validity this way is very simple. So how exactly do you validate and what do you have to write into your paper to convince your readers that your measures are valid. To understand that let's take a look at - compare these latent variable model for validity and construct validity.

So the construct validity perspective is more about epistemology. So it's what can we learn from the correlations in our data. Can we use the correlations in our data to learn something about the constructs? That is a useful way of validation but it doesn't really address whether the test is valid. Then the latent variable theory presented in the last previous slide is about ontology. So does the attribute exist and does the variations in that attribute produce variation in a test score. So these are different. The focus is slightly different.

The concepts of focus here in construct validity is in the correlations. So it's the meaning of what the correlations mean. Can we generalize from observed correlation to a theoretical correlation. In the latent variable model, the idea is on a reference. So do the indicators - the variables - actually refer to any real entity? We have to argue that. Then the empirical focus is correlations. In construct validity, we check the correlations between our data and if those correlations match with the theoretical expectations we conclude that the test is valid.

In latent variable theory, we have to argue the causation. So validation here is not a methodological problem but a substantive problem. So we have to really argue why we think that our IQ test or innovation score actually varies because the construct being measured varies. So we have to explain ideally what is the mechanism of variation and how do exactly person's intelligence for example influence how they do in IQ scores. This is, of course, a lot more challenging task and it places more emphasis on validation studies and the theoretical part of the validation study whereas construct validation is simply about calculating correlations and see whether they match empirical expectations. Both are useful because if your measures don't behave as expected that's a reason to suspect that the measures may not be valid but ultimately that is not sufficient to claim validity. You have to claim - look at the causal process.

We can also take a look at how the latent variable theory differs from classical test theory which gives us the definition of reliability. The idea is that classical test theory is a psychometric model. It's not the measurement theory so the scope is much more narrow. It's a model that describes how people respond to surveys or how they respond to different psychological tests. Latent variable theory is about measurement theory and it takes the realist ontology. Classical test theory doesn't really say anything about ontology so it doesn't say whether the scores measure anything. It only gives us a reliability and true score. Then latent variable theory is focused on validity and construct measurement. The equations for these two models can look similar. So classical test theory is explicitly defined as an equation so the observed scores are deterministic linear combination of true score plus some random noise.

In the latent variable theory this is more general. It's just saying that variation in the construct scores causes variation in observed scores. The statistical - there's therefore some kind of statistical association between the construct and the measure but it may not be necessarily linear. So we can model other kinds of relationships and this takes the relationship - the statistical model - simply as an approximation for the causal relationships.

Then the true score of construct influences different indicators in classical test theory - we take it as an assumption that the true score influences all indicators equally. So if we eliminate all random noise in the data then all the indicators are going to be exactly the same because they share the same true score. This is called the Tau equivalence assumption. Tau is for the true score in Greek.

Then here in latent variable theory, we just say that the indicators - the various indicators - depends on the variation of the construct but we don't really make any explicit claims about how the dependency manifests statistically. So different indicators might depend differently on the construct. Some may be more sensitive to certain levels of the construct and others and this allows us to do all kinds of statistical models particularly the IRT or item response theory models are based on this kind of thinking.

Measurement error in these models is... Classical test theory is simply about random noise and individual items. Then in latent variable theory we can have all kinds of sources of measurement error but the key thing that we have to argue is that the construct actually is a cause of the indicators or the variance of the construct is a cause of the variance of the indicators. And that is much more challenging to do than simply assessing reliability.

Here's a one very simple way that we can use this approach. There are latent variable model to assess reliability and validity. So if we take the assumption that linear or statistical associations are useful for assessing causal relationships then we could say that the observed score is a function of the construct score - we use T here - plus some systematic measurement error plus some random noise. So there are different causal influences to the true score or the construct score that we are estimating with that kind of model here.

So we have error in reliability and we also have the systematic error in validity. The problem of course here is that if you have unique random noise and then you have an indicator that is unique then it may be difficult to know whether the indicators measurement there is actually validity error or reliability error. So oftentimes you can't really say which one it is. Then a summary of all this. We don't really have any proofs of measurement validity. So validation is more of a substantive argument than a statistical argument. Nevertheless, we can say that if many indicators or two or more indicators are highly correlated then they may be measuring the same thing. We just don't know what the thing is and we have to argue based on a theory that the construct actually causes certain kind of behavior in people and then that's how we argue the validity. Then it's possible that the indicators correlate for some other reason and if measure behaves as expected with respect to other measures it may be valid. So that's the construct validity way of validating things and it's a useful technique but you shouldn't rely on it as your only technique. And typically with the latent variable theory, you work with this kind of models so you specify one latent variable as a source of variation of multiple indicators and this is called the common factor model. So it's a factor analysis model and that's commonly used with this kind of validity framework.

This is a very complicated topic. If you want to study more about validity I can recommend you two good books. I like the writings of D. Borsboom. So he has written a book called measuring in the mind which is an introductory level book. So you can read that after reading for example the DeVellis scale development which gives you an overview. And once you have read that book then you can look at more challenging text such as frontiers in tests validity theory by Keith Marcus and Borsboom which summarizes a broad range of validity literature and it's fairly condensed. So that's probably not best for the first book but it's a really great overview of test validity theory.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Measurement validity and validation (20:12)

Opiskelijoille

Opettajille

Palvelusta