TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Logistic regression (7:55)
What is logistic regression? When and how should you make use of it?
Click to view transcript
Logistic regression analysis is a commonly used tool for binary dependent variables.
A binary variable is a variable that receives the values of 1 and 0 and it's very commonly used for decisions that are either yes or no, whether something happens or not.
Whether a company decides to expand internationally or whether it decides to stay in the home markets, whether a person is sick or not, and that kind of data.
To illustrate the logistic regression analysis technique, we need to have some example data and this example data are girls from Warsaw.
And the girls range from about 10 years to about 18 years and the dependent variable here is called menarche and that's whether the girl has had their first period or not.
So we can see here that girls at the age of 10 normally don't have had the first period, and then girls when they are 18 pretty much everyone has had the first period.
And we want to explain this relationship between age and menarche using regression analysis.
There are a couple of problems when we apply normal regression analysis.
For this kind of data set the first problem is that the regression line here goes over 1.
So the value here, the regression line gives the expected value of the dependent variable given age.
And in this case, because the dependent variable is 0's and 1's the expected value is the expected probability of having had menarche.
When we draw the line, we have a problem here because the predicted probability for girls that are 18 exceeds 1, and probabilities are bound between 1 and 0.
Also we have negative probability here.
This also causes a problem for regression analysis because when we have small numbers small fitted values here, then all residuals are positives or the error term can't be independent of the defeated value.
So regression analysis we are violating the no endogenity assumption at least, and are the predictions don't make any sense.
So using a linear model for this kind of data is problematic for these two reasons. Using this kind of linear model would be acceptable if most girls will be around here, so the linear approximation would be okay because it doesn't really predict any negative values, because we can't go beyond the range of the data. But if we have negative predictions on predictions that exceed one within the range of the data, then we have problems.
This model is called linear probability model and it can be used but there are typically better alternatives.
One better alternative is to start discovering better alternatives we need to think about what's the relationship like and we can do a nonparametric analysis, for example, we take a rolling average from the data.
So the idea of rolling average is that we have here about 4,000 girls and then we take the first 500 here we calculate the mean for these first 500 and then we put mark a small dot here.
The average for these girls is zero because no one has at the menarche. Then we shift this window right to a bit we check the next 500 girls so we go from the second girl to the 501st girl like that we calculate the average, we mark it here.
Then we go to the third girl to 502nd girl and we calculate the average for that sub-sample.
Then we continue we'll go here we can see that the mean value is about 50% and finally when we calculate for all possible windows, we calculate the mean.
We get this kind of a nonparametric curve. It's nonparametric because we can't express this curve as a simple function.
We can see that this is an s-shaped curve.
So first when girls get a little bit older some girls start to have menarche but not many. And once you hit about 13-14 then the rate of having menarche increases rapidly until it starts to decrease when you are about at about 15, when pretty much everyone has had menarche except for a couple of exceptions. And then it flattens out at one.
This curve is are called a logistic curve.
So here is the logistic curve and the idea of logistic regression analysis is that instead of fitting a line we fit this logistic curve. The logit curve and the interpretation of the result stays the same so the line gives us the expected probability of a girl having had menarche given their age. But this line as we can as we saw from the previous slide is a much better fit for the data.
So the relationship is not linear rather it follows an S shape and the logit curve is one such as s-shaped curve that we could use and it's very commonly used.
So we get the probability of having had menarche given the age from the model.
The model can be expressed mathematically because all models are just equations and the mathematical expressions for this logistic regression model is as follows.
First you have the linear regression model.
So that's the linear probability model because we have one binary dependent variable and the regression model extends the logistic model extends the normal regression model by taking a function of this fitted value.
So we calculate the linear prediction using the observed data and then we take a function here which gives us the logit curve.
The inverse of this function is called the link function and that's the logit function and this is the inverse. Whether it's called an inverse function or a function doesn't matter.
The important thing for you to understand is that instead of using the predictions directly we apply a function that the predictions that transform the predictions from a line to a curve. Okay, so how do we estimate the model? We can apply OLS estimation. So we apply OLS estimation, then we do Diagnostics.
So we get the residuals here, there's a residual, so we can calculate it then we can plot, residual versus fitted which is one of the standard diagnostic plots and then we can check the normality of the residuals. We have two violations of regression assumptions. First of all, the residual is not normally distributed, but that's not really a big deal.
It's only relevant in very small samples. Then we have our heteroscedasticity problem because the variation of the residuals here is a lot higher than the variation here because the variance is the square of the difference, square of the residual.
Then our so we have our heteroscedasticity problem. We are in violation of then MLR 5 and MLR 6 assumptions.
Whether that's a big deal or not we could use robust standard errors but there are also some computational difficulties when we try to apply least squares approach to this kind of problem.
And because of those computational difficulties and because OLS is not ideal anywhere because of violation of these assumptions, we estimate this using a different approach called maximum likelihood estimation.