TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022
Exponential models for counts (27:55)
This model introduces the exponential models for counts by using the Poisson regression analysis technique.
Click to view transcript
This
model introduces the exponential models for counts. Why the video is
titled, exponential models, instead of just count data, will become
clear soon. What are counts? Counts are typically counts of events; how
many times does something happen? Like if you go fishing, how many fish
do you get, how many times did you catch a fish? If you are running a
company, how many times does the company patent per year? They are
discrete numbers, whole numbers. And they are strictly zero or positive,
non-negative. There
is some confusion around, how you model count variables in the
literature. This article in Organizational Research Methods is one such
example. It's very commonly believed that if you have a variable that is
a count, then you must use some other model than normal regression
analysis, such as the Poisson regression analysis or negative binomial
regression analysis. And this article explains that the application of
normal regression analysis would be inappropriate for data, where the
dependent variable is a count, and if you use the normal regression
analysis for a count, the results can be inefficient, inconsistent, and
biased. And this statement is simply not true, generally. There are
cases where normal regression analysis should not be used for counts.
But a general statement that it is always wrong to use normal regression
analysis for counts, is simply incorrect. How this statement is
justified is by giving two references to econometrics books. The problem
is that these are big books, and there are no page numbers. We can't
really check whether these sources support the claim without going
through and reading the full book, which you can't assume your reader to
do. Whenever you see statements that cite books as evidence, you should
really ask the page number. So where exactly in that book does it say
that regression analysis will be biased, inconsistent, and inefficient
if your dependent variable is a count. To
understand, why using counts could be a problem or is not a problem for
regression analysis, let's review the regression analysis assumption.
This is from Wooldridge's book. And regression analysis assumes four
different things for unbiasedness and consistency. We have a linear
model, we have random sampling, no perfect collinearity, and no
endogeneity. If these are true, regression analysis is consistently
unbiased. There is nothing about not being a count variable here. There
is in fact nothing about the distribution of the dependent variable at
all. It's only about, what's the expected value of the dependent
variable, or the mean given the observed independent variables. We start
getting interested in the distribution of the dependent variable when
we have the efficiency assumption. When you have a homoscedasticity
error, so that variance with the error term doesn't change with the
explanatory variables, then regression analysis is also efficient. But
again, there's no, should not be a count assumption. Using a regression
analysis for counts is completely fine. There is no problem with that. To
demonstrate, let's have an empirical demonstration. We have some dice
here. We have 30 sets of die throws. And we have the number of dice that
were thrown and the number of sixes that we got. The number of dice
that were thrown, the number of sixes that we got, is the independent
variable and the dependent variable in a simple regression analysis. We
draw a regression line. Several die throws, the explanatory variables,
several sixes here. And it looks pretty good to me. The regression line
seems to go through the data. And in fact, there is heteroscedasticity.
The variance here is greater than the variance here. But other than
that, regression analysis is fine. Just use a robust and dares, and this
is going to be the best way to model it. What if we use Poisson
regression analysis, which is commonly recommended for counts? We use
the Poisson model here. The coefficient is 0.02 and for normal
regression analysis, the coefficient is 0.17. So, 0.17 is about one out
of six, which we know that we get for each additional die throw. The
expected number of sixes increases by one out of six, because that's the
probability of getting a six from a fair dice. The 0.02 should be
interpreted as percentages increased. Relative to the current level of
sixes, the expected level of sixes increases by two percent. That
doesn't really make any sense to think about, die throws that way. And
if we plot the Poisson regression line here, it's a curve because
Poisson is an exponential model. We
can see that this exponential model doesn't really explain the data at
all, because when we have one throw, for example, it predicts whether we
get four sixes, that's impossible. And, that the number of sixes here
grows exponentially. It can't, at some point, you hit the limit of how
many times you throw. Just the fact that our dependent variable is a
count, doesn't mean that we can't use regression analysis and that we
should use Poisson regression analysis or some variant of that
technique. The
important thing about the Poisson regression analysis is that it's an
exponential model. We're modeling the expected value of Y as an
exponential function. And this is the important part. When you have an
exponential function, then the least squares are no longer an ideal
technique. If you think that your count depends linearly and additively
on your independent variables, then using normal regression analysis is
not problematic at all, in fact, it's an ideal technique for that kind
of analysis. In Poisson regression analysis, we are using an exponential
function. And that's the reason, why this video is not regression
analysis for counts, but instead, it's exponential models for counts. What
is the Poisson distribution? It's a distribution of the count of
independent events that occur at a constant rate. If you have a rate of,
let's say, 0.001 deaths per capita in a country, how many people die
each year? Something like that. And what does this Poisson distribution
look like? It's a discrete distribution, so we have discrete numbers.
And when we have small numbers, the expected value here is 1, then we
typically get 1, 2, or 3, and getting 20 it's almost impossible. If we
have a large value here, the expected value is 9, then the range of
values that we can get ranges from about 3 to about 20, and that's still
plausible to get. What we can see here is that the dispersion increases
with the expected value. And that's a feature of Poisson regression
analysis. Normally when we have this expected value, this expected value
of 1, the variance is 1, the expected value is 9, the variance is 9.
The variance and the mean of Poisson distribution are the same. Now
coming back to our example of die throws. This distribution is an ideal
distribution for modeling die throws. But we don't need Poisson
regression analysis because that also includes the exponential function,
which we don't need. And using the least-squares estimation technique
is good enough, regardless of the distribution of the dependent
variables. Using a linear function with Poisson distribution would be
unnecessary. Sometimes if we are interested in the actual predictions
from the distribution, how they're distributed, then we could use that.
But normally the Poisson distribution is only required when we do
nonlinear models. When we go for larger values in the expected value.
Let's say we go from 2, 4, 8, and so on to 512, so these are exponents
of 2. We can see that the distribution approaches the normal
distribution. With large numbers, large, expected values, the Poisson
distribution approximates a normal distribution. Whichever you use,
normal distribution or Poisson, then in many cases, it doesn't make a
difference, if you can have the standard deviation of the normal
distribution as a parameter as well. They're roughly the same. The
distribution makes the most difference if your expected value is small.
This is distinctly non-normal as is that, but this is not as much. You
apply the Poisson regression model when you think that the exponential
model is the right model for your data. You're expecting that the
effects are relative to the current level, and they're multiplicative
together. And you interpret these results the same way as you would
results when your dependent variable is log-transformed. The number that
you explain is the expected number of events. One thing that's very
common in studies that apply these techniques is that, if we study for
example, how many people die in each country, and we look at European
countries. The European countries are quite different in size from one
another. Finland has about five or six million people and Germany has
like 100 million people. We must take that into account somehow because
we can't really compare the number of deaths in Finland and the number
of deaths in Germany unless we somehow standardize the data. Quite often
we are looking at, we want to understand the rate at which something
happens instead of the count. And to do that we use the exposure and
offsets. For
example, the number of deaths due to cancer per population, or the
number of citations per article in a journal. The population here and
the article here are what we call exposure. This is like the total
amount of units or whatever at risk, that could have the event occurring
at them. One thing that we could try, if we don't think it through is
just to divide the number of deaths per population. But that's highly
problematic for reasons explained in this article. Using
the rate itself is a bad idea. And the Poisson regression analysis and
the variance of that technique are very useful because there is a nice
trick that we can apply. When we want to model the rate instead of
modeling, the actual count of deaths or counts of citations, we want to
estimate this kind of article. We look at the expectation here,
multiplied by the exposure. We are interested in that kind of model.
This gives us the rate of events, and we multiply it with the size of
our unit, and that will give the actual count of events. We can apply a
little bit of math and move this exposure inside the exponential
function by taking a logarithm and then adding it to the linear
predictor. Taking
a logarithm and including the variable inside the regression model
without the regression coefficient or regression coefficient constraint
to be one, is called an offset. We are basically adding a constant
number to the fitted value, that's calculated based on our observation.
Using an offset is something that your statistical software will do for
you. You specify one variable as an offset and how it works is that the
statistical software takes a logarithm of that value, adds it to the
regression function, but instead of estimating a regression coefficient,
it constrains the effect to be one. And then that allows you to
interpret these effects as rates, instead of as total counts. And that's
very useful. I've used that myself in one article that I'm working on. Then
we have another variant of the Poisson regression model. The Poisson
regression model, Poisson distribution, assumes that the variance of the
distribution of the dependent variable is the same as the expected
value for a given observation. Poisson makes the variance assumption
that the variance equals the mean. We can relax that assumption by
saying that the variance equals alpha times the mean. And that will give
us a negative binomial regression analysis. If alpha is greater than 1,
then we're saying that our data are overdispersion. And that's when
negative binomial regression analysis could be used. If alpha is less
than 1, so the variance of the dependent variable is less than the mean,
then data are under dispersion. So
here is an example. This is the Poisson distribution. The expectation
is 1, 2, and 3, so these are powers of 2 I think, or something like
that. And then alpha is 2, 2, 2, and 3. So we can see that the
expectation stays the same, but the variance increases. When we say here
that the overdispersion here is 3. The variance is 3 times the mean.
The mean is about 3 or something, and the variance is a lot greater.
Negative binomial regression analysis is commonly used for these
scenarios. But the choice between negative binomial and Poisson analysis
is not as straightforward as looking at the amount of dispersion. Which
of these techniques? The common way of choosing between these
techniques is to fit both and then check, which one fits the data
better, using a likelihood ratio test. But there's more to that decision
than just comparing, which one of these fits the data better. So,
whether you use Poisson or negative binomial depends on a couple of
things, and you must understand the consequences of that decision.
Typically, when you choose an analysis technique over another, you have a
specific reason to do so. Using the Poisson regression analysis over
negative binomial regression analysis when we know that the distribution
of the dependent variable is Poisson, then the reason to use Poisson
regression analysis is that it is more efficient than negative binomial,
which is consistent but inefficient in this scenario. When there is
overdispersion, it goes the other way. Poisson is consistent but it is
inefficient, and negative binomial is consistent but efficient. Then
standard errors can be inconsistent for Poisson depending on, which of
the available equations you apply because there is multiple. And you
must consult your statistical software's user manual to know, which one
is applied. Most likely, at least in Stata, you're using the equation
that is consistent even under overdispersion. Then we have under
dispersion. And Poisson regression is consistent, inefficient, and
standard errors may be inconsistent. The negative binomial is
inconsistent, so the estimates will be incorrect in large samples, and
that's bad. Okay,
so this covers the three scenarios. When the dependent variable is
distributed like Poisson, a random variable, it could be over dispersed.
It's also possible that you have a count that doesn't look like Poisson
distribution at all. And in that case, Poisson regression analysis is
consistent, standard errors are inconsistent, and negative binomial
regression is inconsistent. What do we make of it? In some scenarios,
the negative binomial is more efficient than Poisson. In others, it's
less efficient than Poisson. But generally, we want our estimates to be
consistent. So that we may have a bit of inefficiency, but the trade-off
of getting an efficient estimator, that could be inconsistent, that's
not worth making. You want to have something robust. And if your sample
size is large then efficiency differences don't make much difference.
Using Poisson regression analysis is a safe choice if you don't know
what you're doing. If you have a specific reason to believe that your
dependent variable is distributed as a negative binomial, condition
alone on the other fitted values, then you can use a negative binomial.
But using Poisson is a safer option. This is not something that is
current practice, but that's what the methodological literature
suggests. We
have also some extensions to these models. Zero-inflated models are
one. The idea of zero-inflated models is that sometimes you have these
structural zeros, we call them in the sample or in the population. And
Stata's user manual gives this example of a person going fishing, or
people going fishing to a natural park. And the number of fish that they
catch is not distributed as Poisson, because some people choose not to
fish. People get zeros if they choose not to fish, and they get zeros if
they choose to fish but they don't get any. The number of fish that you
get is probably independent events, probably distributed very close to
Poisson, depending on the weather and season, and maybe your fishing
gear and skills, but given the time and given the person, this is most
likely very close to Poisson. Except for those people who decide not to
fish, they will get zeros. This is called a zero-inflation scenario. And
how we handle the zero inflation is that we estimate two models. We
estimate S curve model, typically logistic regression analysis for
structural zeros. This is the idea of modeling, whether a person decides
to fish or not. And then we have exponential count models, such as the
Poisson model, for the number of fishes. We
could have a linear regression model as well if we think that the
linear model is better for the data than the exponential model. We
estimate two models at the same time and these two models give us the
likelihood that we maximize. We must report both models and interpret
both models when we report results. Because it could be interesting,
what defines the structural zeros, and if that's very different from the
actual zeros that occur from the actual process, or non-zero values. Then
we have another commonly, or a bit less commonly used but still
sometimes used, a variant of these models called the hurdle model, which
is like the zero-inflation model. But in this case, instead of looking
at the people who don't fish at all, we look at the difference between
people who get one and people who get one or more. The example here, the
typical example, is going to see a doctor. How many times do you go and
see a doctor? The first time you go to a doctor depends on different
things than whether you go there the second, third and fourth time.
Whether you go to see a doctor the second time probably depends a lot on
what the doctor tells you. And whether you decide to go and see a
doctor in the first place can't depend on what the doctor tells you,
because you haven't seen the doctor. We model this kind of process using
the hurdle model. The idea is that we have two models. We have again
the S curve model for zero and non-zero, and then we have a truncated
version of the exponential count model for the actual count. We model
first, does the person go to a doctor? And then we model, given that the
person went to a doctor at least once, how many times does the person
go to the doctor? Again, you will get two sets of results for two
models, then you usually interpret both and report. Let's
look at an example. This is from the same example in Blevin's paper.
They don't interpret the zero-inflation model, but they present Poisson
regression, negative binomial regression, zero-inflated Poisson, and
zero-inflated negative binomial. We're going to be looking at the
likelihoods and the degrees of freedom. This is not actually degrees of
freedom, but it's the number of parameters instead, which is incorrectly
reported as degrees of freedom. The degrees of freedom difference
between the negative binomial model and the basic Poisson model are one.
The one difference is that these estimate the same model. The
regression coefficients are the same, but the negative binomial
regression model here estimates the amount of overdispersion in a
Poisson distribution that we fit the data. When
we go from the basic Poisson model to the zero-inflated Poisson model,
we can see that the number of parameters is twice of the Poisson model.
The reason for that is that we have two models. We have one model
explaining the structure of zeros, the S curve model. And then we have
the normal Poisson regression model. The negative binomial results and
Poisson results are typically very close to one another because Poisson
is consistent under the negative binomial assumptions. If the sample
size is large, then they should be very similar. The zero-inflated model
results and the negative binomial Poisson results are typically quite
different. And here we can see again the one degree of freedom
difference. How do we choose between negative binomial and Poisson? The
convention is that you do a likelihood ratio test. You compare the
log-likelihood of Poisson against the log-likelihood of the negative
binomial. We can see that there's 400 differences with one degree of
freedom difference, which is highly statistically significant. The
negative binomial here is a much better fit for the data than the basic
Poisson model. The
reason, why negative binomial almost always fits better than the basic
Poisson is that the Poisson model assumes that all the independent
variables in the model explain the mean perfectly. Only variation or the
mean is the variation that belongs to the error term, or however, we
like to call the Poisson distribution. In practice our models are
imperfect. There are always some variables that we could have observed
but did not. That would explain the dependent variable. And if those
explained the dependent variable to a substantial degree, then that
additional variation, which could have been explained but we didn't,
goes to the error term. It's the same thing as in a regression analysis.
If your R-squared is 20% then 80% of the variation is unexplained. If
you add more variables, R-squared increases to 50%, and then the error
variance decreases. The same thing happens here. The negative binomial
model, if it fits better than the basic Poisson model, then it means
that our model is incomplete in explaining the data completely. That's
not a problem, as soon as, any of the omitted causes are uncorrelated
with the explanatory variables, and don't lead to an endogeneity
problem. That's something to be aware of. Finally,
quite often you see this kind of diagram on what to do. And this is
again a convention. How do we choose between the negative binomial and
Poisson model? There is no problem in using the Poisson model for over
dispersed data if you adjust the standard errors accordingly. The
current convention is that you do both models, and then you do a
likelihood ratio test between the Poisson model and the negative
binomial model. If the negative binomial model fits significantly
better, that's evidence of overdispersion, and then you go for the
negative binomial model. Then this article suggests that you look at
whether there are excess zeros. If there are excess zeros, then are you
do one test, and based on that, you either choose a negative binomial
model or a zero-inflated negative binomial model.