TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022
Rauser (2014): Statistics without the agonizing pain (11:48)
Rauser, J. (2014, October 15). Statistics Without the Agonizing Pain. Presented at the Big Data Conference - Strata + Hadoop World, New York, NY. Retrieved from http://strataconf.com/stratany2014/public/schedule/detail/37554
John Rauser, data scientist from Pinterest, talks about the difficulty of understanding the fundamental idea of sampling distribution when presented in the formal mathematical way as seen in most statistical books. He then explains how the same can be easily explained by demonstrating with a computer simulation. His main argument is that when it comes to learning statistics, being able to program a computer gives you superpowers. The video contains a really accessible explanation of sampling distribution and statistical testing (i.e. p-value) and does that by using a cool experiment involving beer drinking and mosquitoes.
Click to view transcript
I'm John rouser and I'm a data scientist at Pinterest. The title of this talk is inspired by this paper. This paper has perhaps the best opening paragraph of any academic work I've ever read which I will now adapt for my purposes. When I decided to learn statistics I read several books which I shall politely not identify. I understood none of them. Most of them simply describes statistical procedures and then applied them without any intuitive explanation or hint of how anyone might have invented these procedures in the first place. This talk was born of that frustration and my wish that future students of statistics will learn the deep and elegant ideas at the heart of statistics rather than a confusing grab bag of statistical procedures.
I wrote this talk because I suspect that many of the people in this audience are faking it when it comes to statistics. There are two kinds of people, two kinds of technical people in this audience: the software engineers and the statisticians. At Strata and in general the engineers are by far the larger group. Actually in this room maybe this picture looks more like this. Right, there are probably few statisticians in this room that would not also self-identify as data scientists. And I think that many of the engineers look with some envy on the people that know statistics. When someone with a little bit of stats knowledge starts going on about power analysis or generalized linear models or two-tailed tests or whatever it is, you know you nod your head and you play along, but you really have no idea what they're talking about. And I know this feeling. I am a software engineer who is self-taught in statistics over a period of about a decade and I remember struggling with what seemed like the most basic questions.
But it doesn't have to be this way. My thesis is that if you can program a computer you have direct access to the deepest and most fundamental ideas in statistics.
To convince you of this idea I want to walk you through a statistical argument and in order to do that, we need a problem to work on. So we're going to use statistics to figure out whether drinking beer makes you more attractive to mosquitoes. This is not a trivial problem though it might seem one. Malaria after all is transmitted via mosquitoes and most models for malaria transmission have historically assumed that all individuals are at equal risk for for mosquito bites. But there's good evidence that humans vary widely and their attractiveness to mosquitoes and so if you can understand which people are at greatest risk for mosquito bites then you can target your interventions much more accurately and you can do a better job of fighting malaria. This is such an important problem that if you do good research on it you can get published in PLoS ONE, an extremely reputable journal. Here I've redacted the keyword in the title so as to leave the outcome in doubt for a moment.
I don't have time to describe their method in the detail that it deserves, but they they basically got a series of volunteers and then randomly assign them to drink either beer or water and then they use this device called a Y-olfactometer to let mosquitoes choose to fly either toward the human subject or towards open air and then they trapped and counted the mosquitoes.
Here's the data. They had 25 volunteers who drank beer and 18 who drank water. And these numbers are the numbers of mosquitoes that were collected in the traps for each of the volunteers. We can compute the average number of mosquitoes in each group and then subtract to find that the average person who drank beer attracted 4.4 more mosquitoes than the average water drinker and now we have a statistical question: is a difference of 4.4 sufficient evidence to claim that drinking beer makes you more attractive to mosquitoes. We can frame this question as a debate between a skeptic and an advocate. The skeptic says that probably it is the case that drinking beer has no effect and that difference of 4.4 that could have just happened by random chance. The advocate takes the other side the says 4.4 is a large difference especially when you compare it to the overall variation in the sample and so the skeptics position is very unlikely to be true. One of the main goals of statistics is to settle just this kind of debates.
So what I want to do is walk you through it twice. First I'll solve the problem using the painful analytical approach that you might remember from your statistics 101 class and then I'll do it again using a simple computational method that should hopefully be much more understandable. So bring on the pain. If you took a stats class or you tried to read a stats book you might dimly recall something called a t-test that can be used to solve exactly this kind of problem. You head off to Wikipedia and you remember that the first thing you need to do is you need to pick a test statistic. There are a whole bunch of possible choices but after reading for a few minutes you settle on this one, which is Welch's t-test. You plug in your numbers and you get a value of 3.67 for your data. You don't really know what this t thing is but Wikipedia tells you that if the skeptic is right, then then t is distributed according to this formula. That funny L looking thing, that's the gamma function and that V looking thing is the number of degrees of freedom. You read on Wikipedia about degrees of freedom for a little while and you learn that degrees of freedom is the number of dimensions of the domain of a random vector. You have no idea what that means, but the page about the t test on Wikipedia says that you can estimate your degrees of freedom with this formula and so you dutifully plug in your numbers and you get that you have 39.1 degrees of freedom. How you figure out the next step I have no idea, but you managed to figure out that what you need to do is take that 39 and use it to look up the critical value for the t test in a table. So you do that and you get that the critical value at the 95 percent level for 40 degrees of freedom is 2.021 and that t statistic that you computed five slides ago, that thing three point six seven is larger than the critical value 2.021. So sweet, you say to yourself, you can now confidently reject the skeptics argument and say that a difference of 4.4 additional mosquitoes is statistically significant at the p < .05 level. I'm willing to bet that only a few people in this audience have any idea how that argument really works even if you are familiar with the statistical recipe that I just ran you through. This thing is the really deep idea at the heart of that argument. If the skeptic refuses to believe your assertion that this is the correct formula, your entire argument falls to the ground in ashes. In the general recipe for statistical inference this thing is called the sampling distribution of the test statistic under the null hypothesis and the reason that stats 101 was so incredibly painful is that the idea of a sampling distribution is really hard to understand even in the best conditions. And when it's presented in pure mathematical formalism like this, as a mathematical object all slathered up in degrees of freedom mumbo-jumbo, it's just hopeless. There might be a handful of people in this audience that could sit down and just derive this equation from first principles right now. I am certainly not among those people and I am a working data scientist.So that was stats 101, The computational or the analytical method.
What about this computational method that I promised. Well, remember the thing that we're trying to figure out is whether this 4.4 is a large or a small difference. So we'll just mark that 4.4 on a plot. Here's our original data color coded to whether the subject drank beer or water. If the skeptic is right these these labels have absolutely no meaning, they're completely meaningless, they carry no information. So what I can do is randomly shuffle them around, and then rearrange them, and then tidy them up, and then compute some new means, subtract the means and get a new difference of 3.3 mosquitos. We'll add a little dot to our plot at 3.3. And now we can start that whole dance over again. We'll start with the original data, we randomly shuffle, we rearrange, we tidy up, compute some means, subtract the means, get a difference of 0.1 this time. And now we'll add that 0.1 to our plot. And we can keep on repeating this process over and over and over again. Here's three repetitions. Here's four. Here's five. Here's six. And look ma, no hands. Right, what is happening here is we are building up the statistic of the sampling distribution under the skeptics argument. You are watching a statistical process unfold. Here is 20 repetitions. Here's 30. Here's 50. Here's 50,000 repetitions. So recall the skeptics argument that there is no difference, that there's no effect, and that the labels were meaningless. This data was generated under that assumption and it shows the range of possibilities. If the skeptic is right a difference of 4.4 is fantastically rare. It happened just 14 times in 50,000 trials. And so it is as the advocate said. The skeptics argument strains credibility and can be safely rejected. And that of course was the conclusion of the researchers that gathered this data in the first place.
To do the statistics that we just did you needed three essential things: the ability to follow a straightforward logical argument, random number generation, and iteration. You were born with the first of these three things and the second two are provided by any decent programming language with a good library. With these three things you have everything you need to understand this argument at a very deep fundamental level. And that is in contrast with this the details of which you really need years of study. I think to come to grips with. Now, that simple computational method I just showed you is called a random permutation test and it's just one of a whole class of methods known as resampling methods. The other resampling method that I would tell you about if I had more time is bootstrapping, which is fantastically useful. But I don't have more time and so I'll just restate my thesis in a slightly different way. The message that I want to leave you with is this if you can program a computer, you have superpowers when it comes to learning statistics, because being able to program allows you to tinker with the most fundamental ideas in statistics the way that you might have tinkered with electronics when you were a kid, or with mechanical things, or with music, or with sports. So I want you to go out and to attack statistical problems with a feeling of joy in the spirit of play and not from a position of fear and self-doubt. That's all I have. Thanks for your time.