TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Sampling and sample selection (21:05)
This video introduces, explains, and provides examples for the concepts of population, sampling frame, sample, selection effect, and selection bias. It also gives an overview of the following sampling strategies: random sample, cluster sample, stratified random sample, and convenience sample.
Click to view transcript
Normally
in a research study, we cannot study full populations because of
practical issues. Therefore, we must rely on a sample. When we take that
sample, there are multiple things that we need to consider, and several
things that can go wrong, that can either produce biased results, or
results that are inefficient. Let's look at some issues related to
sampling. First, we have a population, that population is the thing that
we want to study, so we want to say something about the population.
Let's say, we are studying that population using a survey, that we
mailed to companies. To send out the invitations to participate, we must
have an address for every company. We must have operational definition
of our population. We call this operational population a sampling frame. The
sampling frame is an actual list of companies, people, or whatever
things we're studying. The population is the conceptual definition of
the thing that we're studying. Then related to the sampling frame, if we
are studying individuals in Finland for example, then the sampling
frame could come from the population register, and it could contain
millions of people. Then from the sampling frame, we take the sample.
Typically, we choose people randomly, so the random sample is the
simplest way of taking a sample, and it is often the most desirable way
as well. Then we send out our survey, and some companies choose to
participate, others choose to not participate, we get an actual dataset
that we can work with. Now several things can go wrong, and we must take
those into consideration. Let's
take an example of what this framework means. Let's say, we're studying
the population of young Finnish high technology companies. That is a
conceptual definition, then we need to define empirically, or have an
operational definition of what it means to be a young company, and what
it means to be a technology firm. No one is maintaining a list of
technology companies, so we must operationalize that concept in a way
that we can get data. The sampling frame would be, for example, business
IDs. registered corporations are not the same thing as a company,
because one organization can have multiple business IDs. But we must
have operational definition that we can actually get data for, and we
can get data from business IDs or legal entities behind these companies.
Let's define young technology companies as companies that are zero to
three years old, and are on certain industry codes, for example, 62 or
72. Those correspond to information technology industries. So
that is our operational definition, and that allows us to get a list of
actual companies. Then we get a sample. Let's say that we get a
thousand firms randomly selected on a list of maybe ten thousand
companies, or five thousand companies, whatever the sampling frame is.
The reason for taking a sample here is the cost. Whenever we email or
mail, the address acquisition costs some money or some effort, and if we
mail physical letters, then there are printing costs. Then we get the
actual data, for example, ten percent of our informants that we're
invited to participate in deciding to respond to the survey. What
can go wrong with this kind of thing? There are multiple things. The
relevant question with the sampling frame is, does our operational
definition of the population match the conceptual one? Does the frame
match the population? Then the second question is, how large is the
sample size? Basically, if this is randomly chosen, then the only thing
that we can decide is, how many observations we get. For example, is a
thousand enough? When we plan for sample size, we must take the expected
response rate into consideration. If we expect a ten percent response
rate, and we need five hundred full responses to our analysis, then we
should send out the invitation to five thousand companies, so we would
have five thousand randomly chosen firms instead of one thousand. Then
the most problematic part is that the people, or companies who decide to
respond, may not be randomly chosen. If we have, out of these thousand
companies that were invited to participate, if random 10% respond, that
only means that we have inefficiency. Increasing
the sample size would make our results or estimates more precise, but
that's it. A more problematic condition occurs if these 10 percents are
chosen systematically, and that leads to biased results. For example, if
our survey was about innovativeness, and those
companies that are more innovative are more likely to participate, then
any regression analysis involving innovation as a dependent variable
would produce biased results. Let's look at why that happens. This is a
classic example from Berk's paper 1983, and he is demonstrating that
there's a relationship between education and income such that when
education increases then income goes up as well, so it's a linear
relationship here. What
will happen is that if people who get low income either don't provide
data or these people who don't have much education simply decide not to
work. Let's set a barrier here, so no one above this point provides us
data, or below this point. If we eliminate that data here, what will
happen to our regression estimates? Two things will happen. First, all
the regression results will be biased, because now we are fitting the
regression analysis to this data here, and we are cutting this data,
these observations, that produce negative residuals. These are negative
residuals because they are mostly below the regression line. We have
negative residuals here, and positive residuals here, that are included,
then it pulls the regression line up, so the regression results will be
biased. Then we have no idea of what's the effect like in this group
with low-income people, low education people here, and for those people
for whom we have the data. Then we have biased results. If our sample is
selected systematically, based on the variable that we study, then our
results will be biased, and the magnitude of the bias can be great in
some instances. This is not just an academic concern. I
will next demonstrate a couple of examples. There is a widely known
business book called Good to Great by Jim Collins. It has sold millions
of copies and provided inspiration for lots of managers, also, it
received great attention in Finland, when it was first translated to
Finnish. Many people think this is a valuable book. How was the book
written? Well, there is a slight problem. It's presented as an academic
study, and its kind of is, but there are methodological problems with
this book. In the book, they basically chose many good companies, and
they followed, based on some accounting measures, the performance of
those companies for 40 years with the research team, and then they found
11 initially good companies, and then they became great companies,
according to the definitions that these authors here use. Jim Collins is
the first author, but he had a team of researchers helping him write
the book. They chose 11 companies that perform extremely well and then
they studied, what made those companies perform that well. Then they
asked later, why did these companies perform better than others? And
then they wrote a book about it. The problem with that is two things. First,
if you choose companies that happen to be great in the past, then
you're sampling on the dependent variable. And if a company happens to
be good, for a chance reason, it will get selected, or at least some of
these companies could get selected because of chance reasons. And then,
when some other researchers looked at these companies, later, during the
next 15-year period, only one out of the eleven remained great. We can
just attribute the choice of these eleven companies to chance
explanation. Also, what happens is that, when a company's performing
well, then people start to attribute that performance to something that
the company did. That's called the halo effect. When you identify
companies that are doing well, and then you ask those people to
evaluate, why are these companies doing well, then people answer, "well,
they're doing well because of something that they did in the past".
It's also possible that these companies just happen to be lucky, and the
fact that only one out of eleven stayed great after the 15 years under
study just underlines the point that that's the likely explanation.
These happen to be great for reasons unknown and then people attribute
that greatness to something that the companies did. This sign does not
provide evidence of causality. Let's
take another example, this is from Morgan and Winship's book on causal
inference. They have this hypothetical college, where entry to the
college depends on the SAT exam, the American high school exit exam
basically, and a motivation score that is measured somehow. Motivation
score and SAT score are weakly and positively dependent on each other,
and college entry depends on both. Here is the data, and here's the SAT
score, and here is the motivation score. These guys here were not
accepted to the college, and these circle guys were admitted. The sum of
the SAT score and the sum of their motivation score determines, who
gets to go to this hypothetical college. There's a weak positive
relationship, we can't really see it here, but it's around 0.1
correlation, but it's not visible to the plain eye. What happens if we
measure the correlation only from those people who got admitted to the
college? We only observed students who got to the college, there is a
strong negative correlation. We get a strong negative correlation
because we only studied those people who got admitted. If you were the
principal of this college, a smart principal would ask that does that
result replicate also on those students who didn't get accepted, and
they find that yes it does, so you will get the same negative result.
This negative result has very little to do with the actual relationship
between motivation and SAT score, instead, it's a function of how we
selected the sample. If we choose the sample so that, the sum of
motivation and sum of SAT score must be more than a threshold or less
than a threshold, then you will get this kind of negative correlation
just because of their selection effect. This is called the selection
effect and the outcome is selection bias. Whenever you take a sample
unless you are careful that your sample is a random sample of the
population under study, then you risk having a selection bias in your
analysis, and the bias can be great. Let's
take another really practical example. I went to the building fair in
Vantaa a couple of years ago, and there was this construction company
presenting an idea called a container home. It's a small home, the size
of a shipping container, and these can be built as condominiums. The
idea is that you can increase the density of housing by having these
very small apartments. Then they wanted to get feedback on the idea. How
was the feedback collected? They had a polling station, where you could
indicate whether you agree or disagree with the idea that this
container home is a good idea. And how it was set up is that you walk
along the road here, and the container home was on the side of the road,
so you could choose to just walk by, or you could choose to go in. Then
you went in here, you went through the apartment to the balcony, and
that's where the polling station was. What is the problem? Why could
that produce a selection effect? Of course, people who are not
interested at all, who think this is a stupid idea, will just walk past
the container home, and they will never see the polling station, which
is behind the container home. You have to show enough interest to go
through the container home, walk all the way through to the behind, and
then after you have seen the home, then you present an opinion. The
counterargument for this selection bias is that you only want to have
people who have actually seen what it looks like inside. But that's not
as important as is the fact that, people who think it's a stupid idea in
the first place will just walk by
without providing any data. This is an introduction to issues related to
sampling, and there are multiple different techniques that you can
apply. These selection effects can be modeled, and you can do sampling
in many different ways to increase your efficiency and avoid the risk of
selection bias. There
are other sampling techniques. If you are a Stata user, Stata has a
separate user manual for survey data that discusses different sampling
designs, and here are some references that you may be interested in. The
typical sample in a statistical book is random, and that is also what I
will be covering in this course, because assuming that the sample is
random, simplifies things a lot.
The second kind of sample that is very common, is a cluster sample. A
cluster sample refers to a sample, where the observations are no longer
equally likely to be selected. Random sample is defined as a sample,
where each observation in a population is equally likely to be selected.
A cluster sample, on the other hand, refers to a scenario, where you,
for example, must interview people at their homes. If you do that, and
you take a sample of let's say, all Finnish people, a random sample from
all Finnish households, then you will have to travel all over Finland
to get your data. In practice, we choose a couple of cities, and from
those cities a couple of streets, and we then sample people from those
streets, or just interview everyone on those streets. We take samples
from clusters. If your neighbors are interviewed, then it's more likely
that you are interviewed as well. The probability of being selected is
clustered. If you live close to those people who are more likely to be
selected, you're a lot more likely to be selected as well. The cluster
sample causes some problems, and we'll talk about that later. One
way that we can deal with cluster sampling is called a stratified
sample, a stratified random sample. Stratified random sample concerns
situations, where you have for example uneven distribution of people, or
you have the cluster sample issue. Let's say we have a school with 300
students, out of which 30 are minorities. In that kind of scenario,
taking a random sample of 50 students is going to likely produce a very
small number of minority students. It makes sense to take a sample
separately from the minority students and sample separately from the
other students so that you can get a sample that is better for your
study. Stratification refers to, first dividing the sampling frame into
different strata or different sets, and then you take a random sample
for each set. Stratification improves the distribution of your
variables, it produces random samples that can be better in some
instances, and that's a very commonly used sampling design. These
are the three most used sampling designs. Random sample, everybody is
equally likely to be selected. A cluster sample means, you choose people
from certain areas, which you
choose in advance, so the people in other areas have a zero chance of
being selected. That's cluster sampling. Stratified random sampling
means that you divide your sampling frame into different strata based on
criteria, for example, race, education level, and so on, and then you
take a random sample of each of those strata separately, and that
provides you with some statistical benefits. Then we have the fourth
type of commonly used sample, called the convenience sample. The
convenience sample is none of these, none of that. A convenient sample
is something that we just happen to get. In most cases, if you do a
survey study, and you send out invitations, the people, or organizations
that you choose to invite, maybe a random sample, but in the end, those
that you get data for, is not a random sample of those who got the
invitation, rather it's a convenience sample, just the companies that we
happen to get. Convenience samples are debated to some extent. Some
people argue that they should be avoided, some people argue that
convenience samples are useful, because they allow us to do designs that
wouldn't be possible with random samples for example. But you must
understand these different concepts to understand issues related to
sampling, which I'll cover in later videos.