TU-L0022_aalto-CUR-166088-3086795: Sampling and sample selection (21:05)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 29.03.2023 Etsi kursseja: TU-L0022

Sampling and sample selection (21:05)

Vaatii arvosanan

This video introduces, explains, and provides examples for the concepts of population, sampling frame, sample, selection effect, and selection bias. It also gives an overview of the following sampling strategies: random sample, cluster sample, stratified random sample, and convenience sample.

Click to view transcript

Normally in a research study, we cannot study full populations because of practical issues. Therefore, we must rely on a sample. When we take that sample, there are multiple things that we need to consider, and several things that can go wrong, that can either produce biased results, or results that are inefficient. Let's look at some issues related to sampling. First, we have a population, that population is the thing that we want to study, so we want to say something about the population. Let's say, we are studying that population using a survey, that we mailed to companies. To send out the invitations to participate, we must have an address for every company. We must have operational definition of our population. We call this operational population a sampling frame.

The sampling frame is an actual list of companies, people, or whatever things we're studying. The population is the conceptual definition of the thing that we're studying. Then related to the sampling frame, if we are studying individuals in Finland for example, then the sampling frame could come from the population register, and it could contain millions of people. Then from the sampling frame, we take the sample. Typically, we choose people randomly, so the random sample is the simplest way of taking a sample, and it is often the most desirable way as well. Then we send out our survey, and some companies choose to participate, others choose to not participate, we get an actual dataset that we can work with. Now several things can go wrong, and we must take those into consideration.

Let's take an example of what this framework means. Let's say, we're studying the population of young Finnish high technology companies. That is a conceptual definition, then we need to define empirically, or have an operational definition of what it means to be a young company, and what it means to be a technology firm. No one is maintaining a list of technology companies, so we must operationalize that concept in a way that we can get data. The sampling frame would be, for example, business IDs. registered corporations are not the same thing as a company, because one organization can have multiple business IDs. But we must have operational definition that we can actually get data for, and we can get data from business IDs or legal entities behind these companies. Let's define young technology companies as companies that are zero to three years old, and are on certain industry codes, for example, 62 or 72. Those correspond to information technology industries.

So that is our operational definition, and that allows us to get a list of actual companies. Then we get a sample. Let's say that we get a thousand firms randomly selected on a list of maybe ten thousand companies, or five thousand companies, whatever the sampling frame is. The reason for taking a sample here is the cost. Whenever we email or mail, the address acquisition costs some money or some effort, and if we mail physical letters, then there are printing costs. Then we get the actual data, for example, ten percent of our informants that we're invited to participate in deciding to respond to the survey.

What can go wrong with this kind of thing? There are multiple things. The relevant question with the sampling frame is, does our operational definition of the population match the conceptual one? Does the frame match the population? Then the second question is, how large is the sample size? Basically, if this is randomly chosen, then the only thing that we can decide is, how many observations we get. For example, is a thousand enough? When we plan for sample size, we must take the expected response rate into consideration. If we expect a ten percent response rate, and we need five hundred full responses to our analysis, then we should send out the invitation to five thousand companies, so we would have five thousand randomly chosen firms instead of one thousand. Then the most problematic part is that the people, or companies who decide to respond, may not be randomly chosen. If we have, out of these thousand companies that were invited to participate, if random 10% respond, that only means that we have inefficiency.

Increasing the sample size would make our results or estimates more precise, but that's it. A more problematic condition occurs if these 10 percents are chosen systematically, and that leads to biased results. For example, if our survey was about innovativeness, and those companies that are more innovative are more likely to participate, then any regression analysis involving innovation as a dependent variable would produce biased results. Let's look at why that happens. This is a classic example from Berk's paper 1983, and he is demonstrating that there's a relationship between education and income such that when education increases then income goes up as well, so it's a linear relationship here.

What will happen is that if people who get low income either don't provide data or these people who don't have much education simply decide not to work. Let's set a barrier here, so no one above this point provides us data, or below this point. If we eliminate that data here, what will happen to our regression estimates? Two things will happen. First, all the regression results will be biased, because now we are fitting the regression analysis to this data here, and we are cutting this data, these observations, that produce negative residuals. These are negative residuals because they are mostly below the regression line. We have negative residuals here, and positive residuals here, that are included, then it pulls the regression line up, so the regression results will be biased. Then we have no idea of what's the effect like in this group with low-income people, low education people here, and for those people for whom we have the data. Then we have biased results. If our sample is selected systematically, based on the variable that we study, then our results will be biased, and the magnitude of the bias can be great in some instances. This is not just an academic concern.

I will next demonstrate a couple of examples. There is a widely known business book called Good to Great by Jim Collins. It has sold millions of copies and provided inspiration for lots of managers, also, it received great attention in Finland, when it was first translated to Finnish. Many people think this is a valuable book. How was the book written? Well, there is a slight problem. It's presented as an academic study, and its kind of is, but there are methodological problems with this book. In the book, they basically chose many good companies, and they followed, based on some accounting measures, the performance of those companies for 40 years with the research team, and then they found 11 initially good companies, and then they became great companies, according to the definitions that these authors here use. Jim Collins is the first author, but he had a team of researchers helping him write the book. They chose 11 companies that perform extremely well and then they studied, what made those companies perform that well. Then they asked later, why did these companies perform better than others? And then they wrote a book about it. The problem with that is two things.

First, if you choose companies that happen to be great in the past, then you're sampling on the dependent variable. And if a company happens to be good, for a chance reason, it will get selected, or at least some of these companies could get selected because of chance reasons. And then, when some other researchers looked at these companies, later, during the next 15-year period, only one out of the eleven remained great. We can just attribute the choice of these eleven companies to chance explanation. Also, what happens is that, when a company's performing well, then people start to attribute that performance to something that the company did. That's called the halo effect. When you identify companies that are doing well, and then you ask those people to evaluate, why are these companies doing well, then people answer, "well, they're doing well because of something that they did in the past". It's also possible that these companies just happen to be lucky, and the fact that only one out of eleven stayed great after the 15 years under study just underlines the point that that's the likely explanation. These happen to be great for reasons unknown and then people attribute that greatness to something that the companies did. This sign does not provide evidence of causality.

Let's take another example, this is from Morgan and Winship's book on causal inference. They have this hypothetical college, where entry to the college depends on the SAT exam, the American high school exit exam basically, and a motivation score that is measured somehow. Motivation score and SAT score are weakly and positively dependent on each other, and college entry depends on both. Here is the data, and here's the SAT score, and here is the motivation score. These guys here were not accepted to the college, and these circle guys were admitted. The sum of the SAT score and the sum of their motivation score determines, who gets to go to this hypothetical college. There's a weak positive relationship, we can't really see it here, but it's around 0.1 correlation, but it's not visible to the plain eye. What happens if we measure the correlation only from those people who got admitted to the college? We only observed students who got to the college, there is a strong negative correlation. We get a strong negative correlation because we only studied those people who got admitted. If you were the principal of this college, a smart principal would ask that does that result replicate also on those students who didn't get accepted, and they find that yes it does, so you will get the same negative result. This negative result has very little to do with the actual relationship between motivation and SAT score, instead, it's a function of how we selected the sample. If we choose the sample so that, the sum of motivation and sum of SAT score must be more than a threshold or less than a threshold, then you will get this kind of negative correlation just because of their selection effect. This is called the selection effect and the outcome is selection bias. Whenever you take a sample unless you are careful that your sample is a random sample of the population under study, then you risk having a selection bias in your analysis, and the bias can be great.

Let's take another really practical example. I went to the building fair in Vantaa a couple of years ago, and there was this construction company presenting an idea called a container home. It's a small home, the size of a shipping container, and these can be built as condominiums. The idea is that you can increase the density of housing by having these very small apartments. Then they wanted to get feedback on the idea. How was the feedback collected? They had a polling station, where you could indicate whether you agree or disagree with the idea that this container home is a good idea. And how it was set up is that you walk along the road here, and the container home was on the side of the road, so you could choose to just walk by, or you could choose to go in. Then you went in here, you went through the apartment to the balcony, and that's where the polling station was. What is the problem? Why could that produce a selection effect? Of course, people who are not interested at all, who think this is a stupid idea, will just walk past the container home, and they will never see the polling station, which is behind the container home. You have to show enough interest to go through the container home, walk all the way through to the behind, and then after you have seen the home, then you present an opinion. The counterargument for this selection bias is that you only want to have people who have actually seen what it looks like inside. But that's not as important as is the fact that, people who think it's a stupid idea in the first place will just walk by without providing any data. This is an introduction to issues related to sampling, and there are multiple different techniques that you can apply. These selection effects can be modeled, and you can do sampling in many different ways to increase your efficiency and avoid the risk of selection bias.

There are other sampling techniques. If you are a Stata user, Stata has a separate user manual for survey data that discusses different sampling designs, and here are some references that you may be interested in. The typical sample in a statistical book is random, and that is also what I will be covering in this course, because assuming that the sample is random, simplifies things a lot. The second kind of sample that is very common, is a cluster sample. A cluster sample refers to a sample, where the observations are no longer equally likely to be selected. Random sample is defined as a sample, where each observation in a population is equally likely to be selected. A cluster sample, on the other hand, refers to a scenario, where you, for example, must interview people at their homes. If you do that, and you take a sample of let's say, all Finnish people, a random sample from all Finnish households, then you will have to travel all over Finland to get your data. In practice, we choose a couple of cities, and from those cities a couple of streets, and we then sample people from those streets, or just interview everyone on those streets. We take samples from clusters. If your neighbors are interviewed, then it's more likely that you are interviewed as well. The probability of being selected is clustered. If you live close to those people who are more likely to be selected, you're a lot more likely to be selected as well. The cluster sample causes some problems, and we'll talk about that later.

One way that we can deal with cluster sampling is called a stratified sample, a stratified random sample. Stratified random sample concerns situations, where you have for example uneven distribution of people, or you have the cluster sample issue. Let's say we have a school with 300 students, out of which 30 are minorities. In that kind of scenario, taking a random sample of 50 students is going to likely produce a very small number of minority students. It makes sense to take a sample separately from the minority students and sample separately from the other students so that you can get a sample that is better for your study. Stratification refers to, first dividing the sampling frame into different strata or different sets, and then you take a random sample for each set. Stratification improves the distribution of your variables, it produces random samples that can be better in some instances, and that's a very commonly used sampling design.

These are the three most used sampling designs. Random sample, everybody is equally likely to be selected. A cluster sample means, you choose people from certain areas, which you choose in advance, so the people in other areas have a zero chance of being selected. That's cluster sampling. Stratified random sampling means that you divide your sampling frame into different strata based on criteria, for example, race, education level, and so on, and then you take a random sample of each of those strata separately, and that provides you with some statistical benefits. Then we have the fourth type of commonly used sample, called the convenience sample. The convenience sample is none of these, none of that. A convenient sample is something that we just happen to get. In most cases, if you do a survey study, and you send out invitations, the people, or organizations that you choose to invite, maybe a random sample, but in the end, those that you get data for, is not a random sample of those who got the invitation, rather it's a convenience sample, just the companies that we happen to get. Convenience samples are debated to some extent. Some people argue that they should be avoided, some people argue that convenience samples are useful, because they allow us to do designs that wouldn't be possible with random samples for example. But you must understand these different concepts to understand issues related to sampling, which I'll cover in later videos.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 25.10.2022-29.3.2023

Sampling and sample selection (21:05)

Opiskelijoille

Opettajille

Palvelusta