TU-L0022_aalto-CUR-141790-3063741: When to use GLM (12:48)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

When to use GLM (12:48)

Vaatii arvosanan

The video explains how to chose the most appropriate type of GLM and under which circumstances these statistical models are preferable

Click to view transcript

After watching the GLM videos you must have the question of should you use these models, and if so, when and why? And that's the question that I will answer in this video.

The question number one out of two questions of whether GLM is required or is it useful is: "Is transformation required?" So what do you think is the nature of the relationship with your independent variable and the dependent variable? Is it linear and additive? Do all the independent variables work separately so that the effect on the dependent variable is their sum? Or is it exponential and multiplicative, which means that you multiply the effects of independent variables together to get the effect on the dependent variable? Or is it perhaps the S-curve, when the effect is first very small, then increases and then it's very small because everybody is at 100% already. This is a question that is about theory and what kind of relationships you expect. It's not about question of how the dependent variable is distributed. So this is a primarily a modeling decision not a data decision.

My practical recommendation is that you should always start with linear regression analysis and then do diagnostics. Do an added variable plot, do a residual versus fitted plot and see if there is evidence of non-linearity. If there is, then you consider these alternatives. Of course you may have a strong theoretical reason to believe that an exponential model or an s-curve model is preferable, but still doing the regression analysis is very cheap, it doesn't cost you much time and it will tell you something that you didn't know before most of the time. So starting with the regression analysis is a good idea.

Then the third consideration is that some textbooks and some articles say that you should transform the dependent variable to reduce heteroscedasticity, so that your standard errors will be correct. This decision has nothing to do with standard errors whatsoever. The decision of which transformation you apply is driven by theory, what you think is the best explanation for the data and the consideration of standard errors is secondary to that. And you can always use robust standard errors to deal with any heteroscedasticity issue anyway. So this is driven by theory, not about standard error consistency or about the kind of data that you have.

The next question is that once you have decided that you want to transform your dependent variable somehow you can also transform your independent variables but this is mostly focused on the dependent variable. Should you transform the dependent variable and then apply regression on the transform values or should you apply generalized linear model where you transform the fitted value instead of the dependent value?

There are simple points for and against both decisions. Simple points for transforming the dependent variable is that it's simple to do. So there are no computational issues. Regression analysis will always give you results and you can also use OLS diagnostics. Regression analysis diagnostics are very useful, they are more developed than diagnostics for GLM and you can find more resources on how to do those. Also regression analysis is well understood. For example, the nature of multiplicative effects as I explained in previous videos, is something that many researchers don't fully understand. So regression analysis is more commonly understood by readers and reviewers than GLM.

There are simple points against transforming. Transforming a variable with a few discrete values is problematic. If you have a count variable with one, two and three then trying to do some kind of inverse Poisson transformation on that wouldn't make much sense because it still has three discrete values. If you have ones and zeroes, a binary dependent variable, transforming a binary dependent variable will give you another binary variables. It doesn't do anything. And then you have the issue that if you want to for example explain company size and you want to explain that with an exponential function. Some companies have zero revenues, so how do you deal with those zeros, because you cannot take a log of zero, and then you need these awkward workarounds where you add +1 to the dependent variable before you take the log. So these are simple points against transforming.

There's a more rigorous way of looking at this issue. Let's look at this GLM model and the transformed model. Typically, we're interested in explaining what is the mean of the data or the expected value of the data given the independent variables, in which case we look at the nonlinear regression model. If we apply this transformed dependent variable model and we treat this transformed, these coefficients here as if they were estimates for this original model of interest, then they are actually inconsistent. So the transformed equation is an inconsistent estimator for the original equation. Statistically thinking, you should never transform the dependent variable, you should always use the GLM because the transformed variable is inconsistent estimator of the GLM. That may not be enough to convince all the people but let's take a look at examples.

So I have this dataset here, this is the prestige data set that I've used before. We have the distribution of income for professions that are more than half men and distribution of income for professions that are more than half women. So we have men-dominated and women-dominated professions and we are interested in knowing whether men-dominated professions make more money than women-dominated professions. And this is something that we would typically want to answer with "men make 20% more or 50% more" instead of saying that "men make four thousand Canadian dollars more". Because the percentage is something that we typically think in this kind of comparisons. So how do we do it? We're going to look at percentages and we do transformed dependent variable regression analysis. We get some estimates here. Then we can calculate predictions using these estimates. So the predicted lines are here in the equation. In the plot here, we can see that the predicted lines here are less than the actual sample means. So the model predicts the sample means a bit incorrectly predicting too low, so they are predicted erroneously. Also, the model predicts the difference between the men- and women-dominated professions to be smaller than what it actually is. So both the actual means and the difference between the means are predicted incorrectly. The difference is not great but it's noticeable. So based on these considerations, the GLM approach should always be preferred over transforming the dependent variable. Of course doing the transformation of the dependent variable using OLS doing the diagnostics, that's a good starting point. But in the end, doing the GLM is more rigorous and that's what the end product of your research should be.

There is a nice blog post about this from William Gould, who is the founder of Stata. And he makes a strong case and with some nice references that that's actually how you should do it. So don't log transform the dependent variable. Use the Poisson GLM or QML estimate instead and with robust standard errors. That gives you better estimates than the regression on the transformed dependent variable.

So what are the practical recommendations? Once you have decided that you want to use one of these transformations, then what's the modeling technique that you should apply? So linear additive model least squares always. No reason to use anything else. OLS is best, and weighted least squares could be slightly more efficient in some scenarios but it's not worth therefore to do that.

If you have exponential model with multiplicative relationships, then if you know the distribution of the dependent variable given the fitted values, then use the maximum likelihood estimation of the generalized model with the correct distributions. So if you know that it's Poisson, you know it's negative binomial, you know that it's something else then apply the normal GI. If you don't know what the distribution is or you're uncertain about the distribution of the dependent variable or you know that it doesn't follow any of the distributions that your statistical software supports, then apply Poisson quasi maximum likelihood estimation with robust standard errors. So it's a similarly safe choice than using OLS is for the linear model.

For the s-curve models the same thing, if you know the distribution, if you know that you are using fractional response data and you know that the dependent variable is beta distributed given the predicted values, then use a beta regression analysis, so maximum likelihood GLM with the correct distribution. Otherwise if you don't know the distribution of the dependent variable, then use Bernoulli quasi maximum likelihood with robust standard errors. So if you have fractional response data, then basically I would always recommend that use just the normal logistic regression analysis for that because it works. You would think that it doesn't but it actually does as long as this approach has been programmed to your computer software.

Now this has nothing to do with the transformation of the independent variables. So this is about the dependent variable. Transforming independent variables is okay, and you can consider the log transformation or sometimes even exponential transformation of the independent variables to get a model that you think explains your data well based on your theory and then you estimate it with either OLS or GLM. This is about what you do with the dependent variable.

The final question is that is this GLM and transforming the fitted value versus transforming the dependent value is it a big thing? Let's do an empirical example. So we have here two models. We are using the prestige data. We have a years of education here. We have the predictions from these two models transformed dependent variable and GLM effects on income. When we look at the regression coefficients, we can see that there's a 7.5 % difference. So this is 0.119, this is 0.128. So 7.5 difference that is substantial. In many methodological papers we think that 5 % bias is something that you can ignore but this a 7.5 % difference is something that we should care about. Also when we look at the predictions here, we can see that the transformed dependent variable systematically under predicts how much the professions that require high education actually make, and this blue line here is a lot better fit to the data. So empirically it's not a huge difference, but it's something that I think we should be concerned about because the fix is rather simple.

Now the final question is if and when I get papers to review where authors use a transformation on the dependent variable, do I recommend that those papers are rejected because they don't use the GLM approach or quasi maximum likelihood estimation of Poisson instead of the transformation of the dependent variable? No, I would not say that this red line is worthless. I'm saying that the blue line is better and I would probably recommend the authors to take a look at some articles that are cited here that explain why the blue line is better than the red line and then tell them to make an informed decision.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

When to use GLM (12:48)

Opiskelijoille

Opettajille

Palvelusta