TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Interpretation by plotting marginal predictions (6:29)
This
video explains the challenges in interpreting results of non-linear
models like logistic regression models. The video explains how to use
marginal prediction plots to interpret results of such models.
Click to view transcript
One way to interpret the regression analysis results from logistic regression is to do marginal prediction plots. This is a very useful technique because it's a generic technique. Instead of having to memorize, how every possible different nonlinear regression model is interpreted, you just need one tool. Another advantage is that this tool gives you the effects on the original scale of the dependent variable. In the case of logistic regression analysis, you will directly see, what is the effect of each independent variable on the predicted probability?
To do plotting we need some data. And, I will use the Hosmer and Lemeshow data.
So this is from a widely cited regression analysis book. And the data are about babies born to different kinds of mothers. The dependent variable is, whether the baby was born as low birth weight, defined as less than 2.5 kilos. And we will be looking at the weight of the mother at last menstrual period, the race of the mother, and whether the mother smoked during pregnancy, as our interesting independent variables.
We are first going to fit a linear probability model and logistic regression model to this data. And I'm using Stata here. We have the linear probability model here, and we have the logistic regression model here. And the dependent variable was the low birth weight. And we can see from the linear property model, it is easy to interpret, we get the predicted probability of having a low birth weight baby. It is 0.22 higher for black women than for white women, that is the reference category. It is 15% higher for smokers than for non-smokers. So we can directly interpret the effects.
Here, the odds ratios, we can say that the odds for a black mother are 3.5 times greater, than for a white mother. But that doesn't really tell us anything about the increase in probability, because the odds are a proportional effect; it is a relative effect. You have to know, what is the original odds that is being increased by 3.5?
Plotting is very useful to understand what does these effects look like? So when we compare the effects of race and smoke, these are not really comparable. So it is difficult to say whether a 3.5 increase in odds is a larger effect than 22% increase in probability, because they are expressed in a different scale. And we are usually interested in the original scale of the variable. Also, we can't, from this model directly, say what is the expected difference between black smokers and white non-smokers. The whites are the base category here, so black mothers is a 0.22 and smokers is 0.16. So it is about 40% difference between black smokers and white non-smokers. Easy to see from this model. Here we say that the black mother has 3.5 times greater odds, and smokers have 2.5 times greater odds. So we multiply these together and it is about eight or nine, something like that, times higher odds for black smokers than white smokers. But that's difficult to interpret.
So how we can do that is, we can apply the marginal predictions plots. The Stata's margin command or R's effects command will do that for you quite easily. This is from Stata, so this is the linear predictions. And we can see from the linear model that the effect of birth weight here is the same for all kinds of mothers. So we have three races here, and the effect of weight at the last menstruation is the same for all mothers. So the mothers only differ with respect to the base level. So what's the intercept, because we estimated the effect of race.
For the logistic regression model, we can see that its the same base difference that is here, but the shape of these curves are different. So this curve, flattens here more, and these are lot steeper curves. So when we have a mother that doesn't weigh much, so these are pounds, then for all races the likelihood of having a low weight baby is large. And we can see that for all races the likelihood gets smaller. But also that the likelihood of probability actually converges here. So if you are a very big mother, then you're going to have a very big child. And which one of these fits the data better is partly an empirical question. So one way to understand, which of these plots works better, is to plot the data over these plots and just see, which two sets of lines explains the data better.
We can see here that the linear probability model predicts the negative probability for some heavy white mothers. And this model always predicts between 0 and 1. So this is statistically more appealing. But, if we don't have any mothers here, so if all white mothers are quite light, then the fact that we predict implausible values, when we go beyond our data, is not really a problem.
So, which one of these is better, you can justify based on a theory, but you can also check empirically, which one fits the data better. The logistic regression analysis is typically used by default, because it's a safer choice to apply. But this linear probability model can be used as well, as long as you don't do negative predictions or predictions that exceed 1 for any of the cases in your sample.