TU-L0022_aalto-CUR-141790-3063741: Exponential models for counts (27:55)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Exponential models for counts (27:55)

Vaatii arvosanan

This model introduces the exponential models for counts by using the Poisson regression analysis technique.

Click to view transcript

This model introduces the exponential models for counts. Why the video is titled, exponential models, instead of just count data, will become clear soon. What are counts? Counts are typically counts of events; how many times does something happen? Like if you go fishing, how many fish do you get, how many times did you catch a fish? If you are running a company, how many times does the company patent per year? They are discrete numbers, whole numbers. And they are strictly zero or positive, non-negative.

There is some confusion around, how you model count variables in the literature. This article in Organizational Research Methods is one such example. It's very commonly believed that if you have a variable that is a count, then you must use some other model than normal regression analysis, such as the Poisson regression analysis or negative binomial regression analysis. And this article explains that the application of normal regression analysis would be inappropriate for data, where the dependent variable is a count, and if you use the normal regression analysis for a count, the results can be inefficient, inconsistent, and biased. And this statement is simply not true, generally. There are cases where normal regression analysis should not be used for counts. But a general statement that it is always wrong to use normal regression analysis for counts, is simply incorrect. How this statement is justified is by giving two references to econometrics books. The problem is that these are big books, and there are no page numbers. We can't really check whether these sources support the claim without going through and reading the full book, which you can't assume your reader to do. Whenever you see statements that cite books as evidence, you should really ask the page number. So where exactly in that book does it say that regression analysis will be biased, inconsistent, and inefficient if your dependent variable is a count.

To understand, why using counts could be a problem or is not a problem for regression analysis, let's review the regression analysis assumption. This is from Wooldridge's book. And regression analysis assumes four different things for unbiasedness and consistency. We have a linear model, we have random sampling, no perfect collinearity, and no endogeneity. If these are true, regression analysis is consistently unbiased. There is nothing about not being a count variable here. There is in fact nothing about the distribution of the dependent variable at all. It's only about, what's the expected value of the dependent variable, or the mean given the observed independent variables. We start getting interested in the distribution of the dependent variable when we have the efficiency assumption. When you have a homoscedasticity error, so that variance with the error term doesn't change with the explanatory variables, then regression analysis is also efficient. But again, there's no, should not be a count assumption. Using a regression analysis for counts is completely fine. There is no problem with that.

To demonstrate, let's have an empirical demonstration. We have some dice here. We have 30 sets of die throws. And we have the number of dice that were thrown and the number of sixes that we got. The number of dice that were thrown, the number of sixes that we got, is the independent variable and the dependent variable in a simple regression analysis. We draw a regression line. Several die throws, the explanatory variables, several sixes here. And it looks pretty good to me. The regression line seems to go through the data. And in fact, there is heteroscedasticity. The variance here is greater than the variance here. But other than that, regression analysis is fine. Just use a robust and dares, and this is going to be the best way to model it. What if we use Poisson regression analysis, which is commonly recommended for counts? We use the Poisson model here. The coefficient is 0.02 and for normal regression analysis, the coefficient is 0.17. So, 0.17 is about one out of six, which we know that we get for each additional die throw. The expected number of sixes increases by one out of six, because that's the probability of getting a six from a fair dice. The 0.02 should be interpreted as percentages increased. Relative to the current level of sixes, the expected level of sixes increases by two percent. That doesn't really make any sense to think about, die throws that way. And if we plot the Poisson regression line here, it's a curve because Poisson is an exponential model.

We can see that this exponential model doesn't really explain the data at all, because when we have one throw, for example, it predicts whether we get four sixes, that's impossible. And, that the number of sixes here grows exponentially. It can't, at some point, you hit the limit of how many times you throw. Just the fact that our dependent variable is a count, doesn't mean that we can't use regression analysis and that we should use Poisson regression analysis or some variant of that technique.

The important thing about the Poisson regression analysis is that it's an exponential model. We're modeling the expected value of Y as an exponential function. And this is the important part. When you have an exponential function, then the least squares are no longer an ideal technique. If you think that your count depends linearly and additively on your independent variables, then using normal regression analysis is not problematic at all, in fact, it's an ideal technique for that kind of analysis. In Poisson regression analysis, we are using an exponential function. And that's the reason, why this video is not regression analysis for counts, but instead, it's exponential models for counts.

What is the Poisson distribution? It's a distribution of the count of independent events that occur at a constant rate. If you have a rate of, let's say, 0.001 deaths per capita in a country, how many people die each year? Something like that. And what does this Poisson distribution look like? It's a discrete distribution, so we have discrete numbers. And when we have small numbers, the expected value here is 1, then we typically get 1, 2, or 3, and getting 20 it's almost impossible. If we have a large value here, the expected value is 9, then the range of values that we can get ranges from about 3 to about 20, and that's still plausible to get. What we can see here is that the dispersion increases with the expected value. And that's a feature of Poisson regression analysis. Normally when we have this expected value, this expected value of 1, the variance is 1, the expected value is 9, the variance is 9. The variance and the mean of Poisson distribution are the same.

Now coming back to our example of die throws. This distribution is an ideal distribution for modeling die throws. But we don't need Poisson regression analysis because that also includes the exponential function, which we don't need. And using the least-squares estimation technique is good enough, regardless of the distribution of the dependent variables. Using a linear function with Poisson distribution would be unnecessary. Sometimes if we are interested in the actual predictions from the distribution, how they're distributed, then we could use that. But normally the Poisson distribution is only required when we do nonlinear models. When we go for larger values in the expected value. Let's say we go from 2, 4, 8, and so on to 512, so these are exponents of 2. We can see that the distribution approaches the normal distribution. With large numbers, large, expected values, the Poisson distribution approximates a normal distribution. Whichever you use, normal distribution or Poisson, then in many cases, it doesn't make a difference, if you can have the standard deviation of the normal distribution as a parameter as well. They're roughly the same. The distribution makes the most difference if your expected value is small. This is distinctly non-normal as is that, but this is not as much.

You apply the Poisson regression model when you think that the exponential model is the right model for your data. You're expecting that the effects are relative to the current level, and they're multiplicative together. And you interpret these results the same way as you would results when your dependent variable is log-transformed. The number that you explain is the expected number of events. One thing that's very common in studies that apply these techniques is that, if we study for example, how many people die in each country, and we look at European countries. The European countries are quite different in size from one another. Finland has about five or six million people and Germany has like 100 million people. We must take that into account somehow because we can't really compare the number of deaths in Finland and the number of deaths in Germany unless we somehow standardize the data. Quite often we are looking at, we want to understand the rate at which something happens instead of the count. And to do that we use the exposure and offsets.

For example, the number of deaths due to cancer per population, or the number of citations per article in a journal. The population here and the article here are what we call exposure. This is like the total amount of units or whatever at risk, that could have the event occurring at them. One thing that we could try, if we don't think it through is just to divide the number of deaths per population. But that's highly problematic for reasons explained in this article.

Using the rate itself is a bad idea. And the Poisson regression analysis and the variance of that technique are very useful because there is a nice trick that we can apply. When we want to model the rate instead of modeling, the actual count of deaths or counts of citations, we want to estimate this kind of article. We look at the expectation here, multiplied by the exposure. We are interested in that kind of model. This gives us the rate of events, and we multiply it with the size of our unit, and that will give the actual count of events. We can apply a little bit of math and move this exposure inside the exponential function by taking a logarithm and then adding it to the linear predictor.

Taking a logarithm and including the variable inside the regression model without the regression coefficient or regression coefficient constraint to be one, is called an offset. We are basically adding a constant number to the fitted value, that's calculated based on our observation. Using an offset is something that your statistical software will do for you. You specify one variable as an offset and how it works is that the statistical software takes a logarithm of that value, adds it to the regression function, but instead of estimating a regression coefficient, it constrains the effect to be one. And then that allows you to interpret these effects as rates, instead of as total counts. And that's very useful. I've used that myself in one article that I'm working on.

Then we have another variant of the Poisson regression model. The Poisson regression model, Poisson distribution, assumes that the variance of the distribution of the dependent variable is the same as the expected value for a given observation. Poisson makes the variance assumption that the variance equals the mean. We can relax that assumption by saying that the variance equals alpha times the mean. And that will give us a negative binomial regression analysis. If alpha is greater than 1, then we're saying that our data are overdispersion. And that's when negative binomial regression analysis could be used. If alpha is less than 1, so the variance of the dependent variable is less than the mean, then data are under dispersion.

So here is an example. This is the Poisson distribution. The expectation is 1, 2, and 3, so these are powers of 2 I think, or something like that. And then alpha is 2, 2, 2, and 3. So we can see that the expectation stays the same, but the variance increases. When we say here that the overdispersion here is 3. The variance is 3 times the mean. The mean is about 3 or something, and the variance is a lot greater. Negative binomial regression analysis is commonly used for these scenarios. But the choice between negative binomial and Poisson analysis is not as straightforward as looking at the amount of dispersion. Which of these techniques? The common way of choosing between these techniques is to fit both and then check, which one fits the data better, using a likelihood ratio test. But there's more to that decision than just comparing, which one of these fits the data better.

So, whether you use Poisson or negative binomial depends on a couple of things, and you must understand the consequences of that decision. Typically, when you choose an analysis technique over another, you have a specific reason to do so. Using the Poisson regression analysis over negative binomial regression analysis when we know that the distribution of the dependent variable is Poisson, then the reason to use Poisson regression analysis is that it is more efficient than negative binomial, which is consistent but inefficient in this scenario. When there is overdispersion, it goes the other way. Poisson is consistent but it is inefficient, and negative binomial is consistent but efficient. Then standard errors can be inconsistent for Poisson depending on, which of the available equations you apply because there is multiple. And you must consult your statistical software's user manual to know, which one is applied. Most likely, at least in Stata, you're using the equation that is consistent even under overdispersion. Then we have under dispersion. And Poisson regression is consistent, inefficient, and standard errors may be inconsistent. The negative binomial is inconsistent, so the estimates will be incorrect in large samples, and that's bad.

Okay, so this covers the three scenarios. When the dependent variable is distributed like Poisson, a random variable, it could be over dispersed. It's also possible that you have a count that doesn't look like Poisson distribution at all. And in that case, Poisson regression analysis is consistent, standard errors are inconsistent, and negative binomial regression is inconsistent. What do we make of it? In some scenarios, the negative binomial is more efficient than Poisson. In others, it's less efficient than Poisson. But generally, we want our estimates to be consistent. So that we may have a bit of inefficiency, but the trade-off of getting an efficient estimator, that could be inconsistent, that's not worth making. You want to have something robust. And if your sample size is large then efficiency differences don't make much difference. Using Poisson regression analysis is a safe choice if you don't know what you're doing. If you have a specific reason to believe that your dependent variable is distributed as a negative binomial, condition alone on the other fitted values, then you can use a negative binomial. But using Poisson is a safer option. This is not something that is current practice, but that's what the methodological literature suggests.

We have also some extensions to these models. Zero-inflated models are one. The idea of zero-inflated models is that sometimes you have these structural zeros, we call them in the sample or in the population. And Stata's user manual gives this example of a person going fishing, or people going fishing to a natural park. And the number of fish that they catch is not distributed as Poisson, because some people choose not to fish. People get zeros if they choose not to fish, and they get zeros if they choose to fish but they don't get any. The number of fish that you get is probably independent events, probably distributed very close to Poisson, depending on the weather and season, and maybe your fishing gear and skills, but given the time and given the person, this is most likely very close to Poisson. Except for those people who decide not to fish, they will get zeros. This is called a zero-inflation scenario. And how we handle the zero inflation is that we estimate two models. We estimate S curve model, typically logistic regression analysis for structural zeros. This is the idea of modeling, whether a person decides to fish or not. And then we have exponential count models, such as the Poisson model, for the number of fishes.

We could have a linear regression model as well if we think that the linear model is better for the data than the exponential model. We estimate two models at the same time and these two models give us the likelihood that we maximize. We must report both models and interpret both models when we report results. Because it could be interesting, what defines the structural zeros, and if that's very different from the actual zeros that occur from the actual process, or non-zero values.

Then we have another commonly, or a bit less commonly used but still sometimes used, a variant of these models called the hurdle model, which is like the zero-inflation model. But in this case, instead of looking at the people who don't fish at all, we look at the difference between people who get one and people who get one or more. The example here, the typical example, is going to see a doctor. How many times do you go and see a doctor? The first time you go to a doctor depends on different things than whether you go there the second, third and fourth time. Whether you go to see a doctor the second time probably depends a lot on what the doctor tells you. And whether you decide to go and see a doctor in the first place can't depend on what the doctor tells you, because you haven't seen the doctor. We model this kind of process using the hurdle model. The idea is that we have two models. We have again the S curve model for zero and non-zero, and then we have a truncated version of the exponential count model for the actual count. We model first, does the person go to a doctor? And then we model, given that the person went to a doctor at least once, how many times does the person go to the doctor? Again, you will get two sets of results for two models, then you usually interpret both and report.

Let's look at an example. This is from the same example in Blevin's paper. They don't interpret the zero-inflation model, but they present Poisson regression, negative binomial regression, zero-inflated Poisson, and zero-inflated negative binomial. We're going to be looking at the likelihoods and the degrees of freedom. This is not actually degrees of freedom, but it's the number of parameters instead, which is incorrectly reported as degrees of freedom. The degrees of freedom difference between the negative binomial model and the basic Poisson model are one. The one difference is that these estimate the same model. The regression coefficients are the same, but the negative binomial regression model here estimates the amount of overdispersion in a Poisson distribution that we fit the data.

When we go from the basic Poisson model to the zero-inflated Poisson model, we can see that the number of parameters is twice of the Poisson model. The reason for that is that we have two models. We have one model explaining the structure of zeros, the S curve model. And then we have the normal Poisson regression model. The negative binomial results and Poisson results are typically very close to one another because Poisson is consistent under the negative binomial assumptions. If the sample size is large, then they should be very similar. The zero-inflated model results and the negative binomial Poisson results are typically quite different. And here we can see again the one degree of freedom difference. How do we choose between negative binomial and Poisson? The convention is that you do a likelihood ratio test. You compare the log-likelihood of Poisson against the log-likelihood of the negative binomial. We can see that there's 400 differences with one degree of freedom difference, which is highly statistically significant. The negative binomial here is a much better fit for the data than the basic Poisson model.

The reason, why negative binomial almost always fits better than the basic Poisson is that the Poisson model assumes that all the independent variables in the model explain the mean perfectly. Only variation or the mean is the variation that belongs to the error term, or however, we like to call the Poisson distribution. In practice our models are imperfect. There are always some variables that we could have observed but did not. That would explain the dependent variable. And if those explained the dependent variable to a substantial degree, then that additional variation, which could have been explained but we didn't, goes to the error term. It's the same thing as in a regression analysis. If your R-squared is 20% then 80% of the variation is unexplained. If you add more variables, R-squared increases to 50%, and then the error variance decreases. The same thing happens here. The negative binomial model, if it fits better than the basic Poisson model, then it means that our model is incomplete in explaining the data completely. That's not a problem, as soon as, any of the omitted causes are uncorrelated with the explanatory variables, and don't lead to an endogeneity problem. That's something to be aware of.

Finally, quite often you see this kind of diagram on what to do. And this is again a convention. How do we choose between the negative binomial and Poisson model? There is no problem in using the Poisson model for over dispersed data if you adjust the standard errors accordingly. The current convention is that you do both models, and then you do a likelihood ratio test between the Poisson model and the negative binomial model. If the negative binomial model fits significantly better, that's evidence of overdispersion, and then you go for the negative binomial model. Then this article suggests that you look at whether there are excess zeros. If there are excess zeros, then are you do one test, and based on that, you either choose a negative binomial model or a zero-inflated negative binomial model.

The problem with this approach is that the one test is problematic and, you should not be doing modeling decisions based on empirical results only. Zero-inflation is a hypothesis that has a theoretical interpretation. If you use the zero-inflation model, then you're making a hypothesis that your data are the results of two different processes, a process that generates zeroes, people who never go fishing will never get fish. And so, you have theoretical guidance that you can usually use to choose whether you use zero inflation or not. Is there a plausible mechanism for the zeroes, then you apply zero inflation. Otherwise, you apply Poisson regression analysis, because the zero inflation is not a violation of the Poisson regression assumptions.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Exponential models for counts (27:55)

Opiskelijoille

Opettajille

Palvelusta