TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Classical test theory and reliability (8:34)
This video uses a simple example of measuring a person's weight to explain the assumptions of classical test theory, how it can be used and its limitations.
Click to view transcript
The most common way of assessing reliability in social science research originates from psychology and, more specifically, it's based on something called, "classical test theory". Understanding the basics of this theory is therefore important because it makes some assumptions that you must make for your reliability indices to be valid. Now we're going to be looking at what the classical test theory says about reliability and what kind of assumptions it makes.
The idea of reliability is that if we repeat the same measurement many times we will get the same result. So it's about the consistency of measurement and lack of random noise. Here's an example of three people that each have been measured three times using the same bathroom scale. And this bathroom scale is completely valid and free from any systematic error. So it gives the correct reading on average, but the actual scale reading varies from one measurement to another and its random measurement error around the real weight of the person. So for this person who's a bit more than 60 kilos, we have three measurements: 61, 61 and 59.
The reliability here is quite high. So the variation of real weight is 8.19 and a variance of measurement is 8.39, which is only slightly larger than the actual real weight. And that's also something you can check by comparative value. So these are about eighty all the times, these are all the time about seventy, and this all the time is about sixty. So it is very easy to check which of these persons is the heaviest based on this scale. So it's sufficiently reliable for this sample.
And when we have this kind of problem, we want to quantify reliability. And on the reliability, we can say that something is zero percent reliable or it's 100% reliable. 0% reliable means that it doesn't provide us any information about the phenomenon and all variation of the phenomenon. Then 100 percent reliable means that it gives us the exact correct readings, or exactly the same reading every time when the same person is being measured. That is different from the reading being correct, because reliability doesn't address any systematic measurement errors.
The reliability concept comes from classical test theory and the classical test theory is basically this equation here. The idea is that the measured score X is the sum of true score plus some random noise. And the only thing that is the measurement error here is random noise, so there's no systematic error here. So it's a very simple theory and it gives us the definition of reliability. Reliability is defined as the squared correlation between X and T, or R-square of regression of X on T, or if we have standardized estimates 1 minus variance of e, or the share of variance of measured scores X that can be attributed to the true score T.
So basically it is the amount of how large a share of our observed variance, or the variance of X, is due to the random noise and how much is due to the true score. So it's like a signal-to-noise ratio. So the reliability is simply defined that way so the ratio of true score variance is the total variance. So what is validity in this theory? Turns out that classical test theory is not really a measurement theory because it doesn't address validity. Instead it is a theory that allows us to define reliability.
So classical test theory doesn't really tell us what the T is. And we cannot particularly infer that the T is, that is we cannot make claims based on classical test theory that the T would be, a score of any particular construct. The T is simply whatever the long-run measurement result would be if we repeated the study over and over and over, or the measurement over and over over using the same measurement instrument and the same subjects. This is very clear from the original works by Lorde and Novick who formalize this theory. They explicitly say that the true score T in classical test theory does not necessarily agree with any construct score and it may not even be a valid or useful measure of any particular construct. So this theory basically gives you just a definition of reliability and that's it.
There are some problems that people see with this theory. The first problem is that it assumes that errors are pure random noise. So there is no room for systematic measurement error here, and oftentimes we could have measurement error - systematic error. For example, we could have a bathroom scale that always shows 10 percent too much or 2 kilos too much. That is beyond the scope of this theory. Also, it doesn't address validity at all. But this is a useful theory because it gives us reliability. And it's a theory for reliability and you should not ask it for more than what it provides. But there's still one more final question, if you only think about what is reliability. The question is that if the T here, the true score which is simply the part of X that is reliable, what exactly is T? So that, the theory doesn't answer. And that's a validity question. So it's not a question about reliability.
There is also one interesting feature about reliability. So let's take a look at the bathroom scale. So the idea of reliability was that the reliability is the true score variation here, 8.19 divided by the actual observed score various and 8.39. So this is ninety-eight percent reliable. What will happen if everybody is the same weight? So if data are here, the variance of real weights is zero. And our bathroom scale reading varies with the variance is 0.67. Turns out that reliability is zero because it's zero divided by 0.67. So if there is no variation in the population or no variation in the sample, depending on whether you're interested in the population reliability or the sample reliability, then the reliability will be zero. Because it is the ratio of true score variance divided by the total variance of the data. One way to understand this is that any readings here, any variation of the readings here, is purely due to measurement error because there's no variation. So this variation here doesn't tell us anything about how these people vary in their weights because there's no variance.
One way to understand, or another way to understand, this issue is that reliability is an index of precision compared to the required precision. So here, this is slightly imprecise because we have for example this person where the measured weights vary between plus and minus one kilo, but there is a sufficient degree of precision to say that this person is the lightest person because there is so much variance in the real weights. Here, to say that these people who are the same weight to the first decimal, then we would need a lot more precision to say which one of these people is the heaviest and which one is the lightest. So therefore the reliability for this scale for making our inferences on who of these is the lightest, or who of these is the heaviest person, is very low and it's actually zero. So reliability is the ratio of the precision of the instrument and the actual variance of the thing that you are measuring.