TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
When to use GLM (12:48)
The video explains how to chose the most appropriate type of GLM and
under which circumstances these statistical models are preferable
Click to view transcript
After
watching the GLM videos you must have the question of should you use
these models, and if so, when and why? And that's the question that I
will answer in this video.
The question number one out of two
questions of whether GLM is required or is it useful is: "Is
transformation required?" So what do you think is the nature of the
relationship with your independent variable and the dependent variable?
Is it linear and additive? Do all the independent variables work
separately so that the effect on the dependent variable is their sum? Or
is it exponential and multiplicative, which means that you multiply the
effects of independent variables together to get the effect on the
dependent variable? Or is it perhaps the S-curve, when the effect is
first very small, then increases and then it's very small because
everybody is at 100% already. This is a question that is about theory
and what kind of relationships you expect. It's not about question of
how the dependent variable is distributed. So this is a primarily a
modeling decision not a data decision.
My practical
recommendation is that you should always start with linear regression
analysis and then do diagnostics. Do an added variable plot, do a
residual versus fitted plot and see if there is evidence of
non-linearity. If there is, then you consider these alternatives. Of
course you may have a strong theoretical reason to believe that an
exponential model or an s-curve model is preferable, but still doing the
regression analysis is very cheap, it doesn't cost you much time and it
will tell you something that you didn't know before most of the time.
So starting with the regression analysis is a good idea.
Then the
third consideration is that some textbooks and some articles say that
you should transform the dependent variable to reduce
heteroscedasticity, so that your standard errors will be correct. This
decision has nothing to do with standard errors whatsoever. The decision
of which transformation you apply is driven by theory, what you think
is the best explanation for the data and the consideration of standard
errors is secondary to that. And you can always use robust standard
errors to deal with any heteroscedasticity issue anyway. So this is
driven by theory, not about standard error consistency or about the kind
of data that you have.
The next question is that once you have
decided that you want to transform your dependent variable somehow you
can also transform your independent variables but this is mostly focused
on the dependent variable. Should you transform the dependent variable
and then apply regression on the transform values or should you apply
generalized linear model where you transform the fitted value instead of
the dependent value?
There are simple points for and against
both decisions. Simple points for transforming the dependent variable is
that it's simple to do. So there are no computational issues.
Regression analysis will always give you results and you can also use
OLS diagnostics. Regression analysis diagnostics are very useful, they
are more developed than diagnostics for GLM and you can find more
resources on how to do those. Also regression analysis is well
understood. For example, the nature of multiplicative effects as I
explained in previous videos, is something that many researchers don't
fully understand. So regression analysis is more commonly understood by
readers and reviewers than GLM.
There are simple points against
transforming. Transforming a variable with a few discrete values is
problematic. If you have a count variable with one, two and three then
trying to do some kind of inverse Poisson transformation on that
wouldn't make much sense because it still has three discrete values. If
you have ones and zeroes, a binary dependent variable, transforming a
binary dependent variable will give you another binary variables. It
doesn't do anything. And then you have the issue that if you want to for
example explain company size and you want to explain that with an
exponential function. Some companies have zero revenues, so how do you
deal with those zeros, because you cannot take a log of zero, and then
you need these awkward workarounds where you add +1 to the dependent
variable before you take the log. So these are simple points against
transforming.
There's a more rigorous way of looking at this
issue. Let's look at this GLM model and the transformed model.
Typically, we're interested in explaining what is the mean of the data
or the expected value of the data given the independent variables, in
which case we look at the nonlinear regression model. If we apply this
transformed dependent variable model and we treat this transformed,
these coefficients here as if they were estimates for this original
model of interest, then they are actually inconsistent. So the
transformed equation is an inconsistent estimator for the original
equation. Statistically thinking, you should never transform the
dependent variable, you should always use the GLM because the
transformed variable is inconsistent estimator of the GLM. That may not
be enough to convince all the people but let's take a look at examples.
So
I have this dataset here, this is the prestige data set that I've used
before. We have the distribution of income for professions that are more
than half men and distribution of income for professions that are more
than half women. So we have men-dominated and women-dominated
professions and we are interested in knowing whether men-dominated
professions make more money than women-dominated professions. And this
is something that we would typically want to answer with "men make 20%
more or 50% more" instead of saying that "men make four thousand
Canadian dollars more". Because the percentage is something that we
typically think in this kind of comparisons. So how do we do it? We're
going to look at percentages and we do transformed dependent variable
regression analysis. We get some estimates here. Then we can calculate
predictions using these estimates. So the predicted lines are here in
the equation. In the plot here, we can see that the predicted lines here
are less than the actual sample means. So the model predicts the sample
means a bit incorrectly predicting too low, so they are predicted
erroneously. Also, the model predicts the difference between the men-
and women-dominated professions to be smaller than what it actually is.
So both the actual means and the difference between the means are
predicted incorrectly. The difference is not great but it's noticeable.
So based on these considerations, the GLM approach should always be
preferred over transforming the dependent variable. Of course doing the
transformation of the dependent variable using OLS doing the
diagnostics, that's a good starting point. But in the end, doing the GLM
is more rigorous and that's what the end product of your research
should be.
There is a nice blog post about this from William
Gould, who is the founder of Stata. And he makes a strong case and with
some nice references that that's actually how you should do it. So don't
log transform the dependent variable. Use the Poisson GLM or QML
estimate instead and with robust standard errors. That gives you better
estimates than the regression on the transformed dependent variable.
So
what are the practical recommendations? Once you have decided that you
want to use one of these transformations, then what's the modeling
technique that you should apply? So linear additive model least squares
always. No reason to use anything else. OLS is best, and weighted least
squares could be slightly more efficient in some scenarios but it's not
worth therefore to do that.
If you have exponential model with
multiplicative relationships, then if you know the distribution of the
dependent variable given the fitted values, then use the maximum
likelihood estimation of the generalized model with the correct
distributions. So if you know that it's Poisson, you know it's negative
binomial, you know that it's something else then apply the normal GI. If
you don't know what the distribution is or you're uncertain about the
distribution of the dependent variable or you know that it doesn't
follow any of the distributions that your statistical software supports,
then apply Poisson quasi maximum likelihood estimation with robust
standard errors. So it's a similarly safe choice than using OLS is for
the linear model.
For the s-curve models the same thing, if you
know the distribution, if you know that you are using fractional
response data and you know that the dependent variable is beta
distributed given the predicted values, then use a beta regression
analysis, so maximum likelihood GLM with the correct distribution.
Otherwise if you don't know the distribution of the dependent variable,
then use Bernoulli quasi maximum likelihood with robust standard errors.
So if you have fractional response data, then basically I would always
recommend that use just the normal logistic regression analysis for that
because it works. You would think that it doesn't but it actually does
as long as this approach has been programmed to your computer software.
Now
this has nothing to do with the transformation of the independent
variables. So this is about the dependent variable. Transforming
independent variables is okay, and you can consider the log
transformation or sometimes even exponential transformation of the
independent variables to get a model that you think explains your data
well based on your theory and then you estimate it with either OLS or
GLM. This is about what you do with the dependent variable.
The
final question is that is this GLM and transforming the fitted value
versus transforming the dependent value is it a big thing? Let's do an
empirical example. So we have here two models. We are using the prestige
data. We have a years of education here. We have the predictions from
these two models transformed dependent variable and GLM effects on
income. When we look at the regression coefficients, we can see that
there's a 7.5 % difference. So this is 0.119, this is 0.128. So 7.5
difference that is substantial. In many methodological papers we think
that 5 % bias is something that you can ignore but this a 7.5 %
difference is something that we should care about. Also when we look at
the predictions here, we can see that the transformed dependent variable
systematically under predicts how much the professions that require
high education actually make, and this blue line here is a lot better
fit to the data. So empirically it's not a huge difference, but it's
something that I think we should be concerned about because the fix is
rather simple.
Now the final question is if and when I get papers
to review where authors use a transformation on the dependent variable,
do I recommend that those papers are rejected because they don't use
the GLM approach or quasi maximum likelihood estimation of Poisson
instead of the transformation of the dependent variable? No, I would not
say that this red line is worthless. I'm saying that the blue line is
better and I would probably recommend the authors to take a look at some
articles that are cited here that explain why the blue line is better
than the red line and then tell them to make an informed decision.