TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Perfect collinearity of independent variables (3:31)
In this video, the concept of collinearity is unfolded with the help of an example
Click to view transcript
In this video, I will demonstrate the no perfect collinearity assumption of regression analysis. The perfect collinearity assumption means that each independent variable must bring unique information to the model. So it cannot be possible to infer values for one independent variable based on other independent variables.
Let's see what that means. Here is the data for the prestige data set and we have a categorical variable type. If we create the dummy variables for the categorical variable type, we know that type blue collar or type professional are both 0, then the observation must be type white collar. On the other hand, if type blue collar or type professional is, either one of them is 1, then type white collar must be 0. So the type white collar doesn't give us any new information, if we know the type blue collar and type professional. In practice when we do a regression analysis, where we specify the type as a categorical variable, we will get two estimates. We will get the estimate of type professional and type white collar, and one of these categories is left out as a reference category. The reason for leaving it out is that including it in the model leads to violation of perfect collinearity, of no perfect collinearity assumption. So let's try and see what happens, when we force all three dummies to be included in the model. So I have to specify, create the dummies manually, and then specify them into the model manually. So I specify the model like that. We have type blue collar, type white collar, type professional here, and we try to estimate the model. We get a warning that '1 not defined because of singularities'. So that warning tells us that we are in violation of the no perfect collinearity assumption, and in the results, we will see that one of these variables was dropped in the analysis. So the estimate of type professional couldn't be or the effect couldn't be estimated. And this is a very common behaviour, basically, you cannot estimate the regression model that includes all these three dummies, because of collinearity, it's just mathematically impossible. So the software has two alternatives, either it refuses to even try, or it just drops one of the variables to make them all estimable. Whichever variable is dropped, probably depends on the order of entry of the variables, but it's not documented behavior of R.
So when you see that you don't get estimates for some of the variables, that is a pretty good indication that you are in violation of the no perfect collinearity assumption, and you need to do something about your model, or it could also be an indication that you have done a data coding mistake, or data preparation mistake: accidentally copied, for example, one variable to your data set twice under different names. So that can happen as well. When you encounter these 'no estimates' or NAs or periods, depending on the statistical software, it indicates a data problem.