TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Assessing reliability (10:59)
Description to be added.
In this video two ways of assessing reliability, test-retest and distinct tests, and its underlying assumptions are covered. Also, the usefulness of rule of thumbs for assessing reliability is discussed.
Click to view transcript
It is important that our study results are sufficiently
reliable. Also, it is important to be able to argue reliability based on
empirical data. So how exactly do we assess reliability and how exactly
do we argue that our results are reliable? Before
we go into that there is one thing that I want to address. These are
called rules of thumb. Particularly with respect to reliability and
measurement validity, there is this tendency for researchers to think
that if you have a statistic that exceeds a particle threshold then
everything is okay. If the same statistic falls just below the threshold
then the study is worthless. So this kind of yes or no thinking is not
ideal and particularly you cannot really justify that kind of yes or no
based on any good methodological resource. Many authors tend to cite
Nunnally's book on psychometric theory for the rule of thumb of 0.7 for
Cronbach's alpha which is coefficient alpha which is a reliability
statistic. Well, the problem is that he doesn't make that kind of claims
in his book. Instead, reliability is something that you have to take
into consideration if your study measures are 80 percent reliable
sometimes that is enough. Sometimes 80 percent is not enough. You have
to explain to your reader what does it mean. So what kind of bias do you
expect if you have 70 percent reliability. What kind of bias do you
expect if you have 95 percent reliability. Is it probable or not? Is not
a fact matter of exceeding a certain cutoff instead it is a matter of
understanding what reliability means for your results and then
explaining that to your readers. So before we talked
about these actual statistics it is important to understand what kind of
assumptions the reliability statistics are based on and what is the
principle of assessing reliability. With the bathroom scale example that
I used in the previous video, it is very simple. You measure the same
person again with the same scale - if you get the same result then your
measure is reliable. When we measure people or organizations through
surveying people for example things are a bit more complicated. The
reason is that if we ask a person whether they like for example United
Nations, and we ask the person again if they like United Nations. The
second answer to the question is influenced by the previous answer. So
if we ask the person the same question over and over, they will give us
the same answer because that's how they answered the last time. So
whereas a bathroom scale doesn't remember what the previous measure was -
people do and that's a problem. So classical test theory has this
concept of parallel tests. The idea of a parallel test is a hypothetical
scenario where we would measure the same person again without that
person having any recollection of the previous measurement or case. So
an example here is that if we ask Mr. Brown whether he likes United
Nations or not then if we ask him the same question again we have to
brainwash Mr. Brown in between those two questions so they are really
independent tests of the same attribute. This of course is a
counterfactual argument because we cannot brainwash our subjects. Our
subjects will know what they answered the last time. So if we ask the
survey question we ask the next question, how the person answers the
next question will be influenced by how they answered the first
question. So we simply cannot ask the same question over and over. There
are 2 workarounds for this problem that we really cannot do these
parallel tests. Test the same attribute of the same person at the same
occasion or without the person having any recollection of being tested
before. The two ways are we either do actual replications and we assume
that they are parallel. So that will work if we have a time delay. For
example, if we ask a person now whether they like United Nations we ask
them the same question a week after then they may not remember anymore
what the original answer was. In which case we could argue that those
repeated measures are mimic the parallel tests scenario. Another way is
to assume that we do two distinct measures. So we measure the same
question in a different way. We measure the same attribute in a
different way and we assumed that those two different ways of measuring
the same thing are parallel. So instead of asking the person whether he
likes United Nations or not - we could ask him whether he thinks that
United Nations is the best thing that has ever happened to mankind for
example. So we measured the same thing again but slightly differently.
So with that way, we could say that the second way, the second
measurement is not as much influenced by the first measurement as it
would be if we just repeated the same question over and over. The
first approach by repeating the exact study except measuring again with
a time delay is called test-retest reliability. So the idea is that if
the attribute that we're measuring is relatively stable over time - then
if a person answers or tests differently at a different occasion then
the only reason for the difference between the two tests is
unreliability because the trait is stable. Also, we have to make the
assumption that errors are independent which is justified by the time
delay. So you don't remember what you answered the last time because
there's a time delay. An example here would be that if we measure a
child that wiggles or if the measurements are done in a matter of
seconds the true way does not change. We cannot argue test-retest
reliability in that case with for example one year time delay. So we
can't measure a child at five years and a child at six years and then
say that the weights from those two measurements differ is the evidence
of unreliability. That would not be valid evidence of unreliability
because we cannot assume that the trait is stable over such a long
period. So you have to consider how quickly the trait or the thing that
is being measured changes over time and how quickly people reset by
forgetting that they were tested or how exactly they answered the
question in the first place. So that's test-retest reliability. Let's
take a look at the example of test-retest reliability from Yli-Renko's
paper. They say that they asked slightly different question again with a
two-year delay on the key construct and the study was about small
companies and the two-year delay of course here for that to be valid you
would have to assume that nothing changes within small companies in two
years time. That is of course not a valid assumption. So we can't make
that assumption here that the trait doesn't change. So this would not be
a valid test-retest. It would be valid if you did a survey of a
business organization if there is like a two-week or a month time delay.
Then you could reasonably assume that there are no major changes but if
you have a two-year delay - like in this paper - then that is not a
very good test-retest reliability estimate. So test-retest is you
measure the same thing again with a time delay that is appropriate for
your measure and the trait being measured so that it allows people to
reset between measurements but the trade doesn't really change
substantially between the measurements. This is not as commonly used
because of course if you do two rounds of a survey study it is more
expensive than to do just one round of a survey study. So we actually
use more commonly another way which is the distinct test. The
reason for having multiple survey questions that look the same or look
like they would measure the same thing - the reason for that is that we
actually think that they are distinct tests. So that's the most common
reason for using multiple survey questions to measure the same thing.
For example, we could ask the company - the person to rate whether the
company is innovative or not whether they're the technological leaders
in the industry, and whether they are the first ones to bring new
product concepts to markets. We could argue that these are distinct
questions. So you don't answer the second question similarly to the
first question because these are really different questions. But they do
measure the same trait that's the argument we have to make. So the idea
of distinct tests is that we generate tests that are not the same. So
they're sufficiently different but we could still argue that they all
measure the same thing. And how we use that data from these multiple
distinct tests produces different ways of assessing reliability. So
there is internal consistent method, alternative forms method, and
split-half method. Understanding exactly what these all do is not
important. It is important to understand the principle
and then understand a couple of statistics that you can calculate from
the data and then understand their interpretations. The really important
part here is that the tests really have to be distinct. So if you're
just asking the same question over and over with slightly different
wording for example our firm is very innovative, our company is very
innovative and our business organization is very innovative. These are
not distinct tests it is just asking the same question over and over
with slightly different wording. And this is something that you see very
commonly as a reviewer. So authors are just writing questions that are
the same without paying much attention to the distinctiveness of these
questions and that's a big problem that I see in management research.