TU-L0022_aalto-CUR-166088-3086795: Categorical independent variables (5:15)

Home Schools Course feedback Service Links Intelliboard

This course space end date is set to 29.03.2023 Search Courses: TU-L0022

Categorical independent variables (5:15)

Receive a grade

In this video, what is categorical variables and how to deal with them when they come as independent variables is explained in detail.

Click to view transcript

Categorical variables are variables that don't really have any order. So for example, a Country would be a categorical variable with values of Finland, Sweden and Norway.

Our Prestige dataset has a categorical variable of Occupation Type, and the categories are - 'blue collar', 'white collar' and 'professional' workers.

So, how do we deal with these kind of variables? When you have a categorical variable as an independent variable, things are relatively straightforward. When you have a categorical variable as a dependent variable then things will get a bit more complicated. We will now cover the case of a categorical variable as an independent variable.

So our data looks like that. So, when we take a summary of the prestige dataset, for the variable 'type', we can see that there are frequencies. There are no means or standard deviations or such, just frequencies of different values. We have 44 blue collar professions, 31 professional professions, and 23 white collar occupations,

and then we have four missing values.

So how do we deal with that in a regression analysis? We can't put that as an independent variable because a unit difference doesn't make sense. We can't say that the difference between 'blue collar' to a 'professional' is one unit, or the difference between a 'professional' and 'white collar' is one unit, and the difference between 'blue collar' and 'white collar' is two units. It doesn't make sense, because we can't say that there is a magnitude of difference between these values, and we can't say that there's an order.

So how do we deal with that? We use something called dummy variables. We code our data like that. So that is a subset of our data. And we have the variable - 'type' here.

Then each observation in the data set gets a code for one of the dummy variables.

So dummy variables are 'type blue collar', 'type professional' and 'type white collar'. And then the dummy indicates that this first occupation is a professional occupation. So 'type professional' gets one, others get zeros. Then we have, a 'type blue collar' is one, for this blue collar occupation, others are zero. So these are dummy or indicator variables, and they indicate, which occupation or category each occupation belongs to.

Then we add the dummies to regression analysis. Stata or R can do that automatically for you, so you don't have to do that coding manually. If you want to use SPSS, you have to manually create the dummies. So SPSS doesn't simplify your life that much in that regard.

Then let's take a look at the regression results. When you add a categorical variable

in regression analysis in R and Stata, then we have the categorical variable here. R will automatically note that this is a categorical variable and it will produce two dummy variables, 'type professional' and 'type white collar'.

So how do we interpret those results, and where is the 'blue collar professional'?

Well the first thing that we need to understand is that every time, when you have a categorical variable and dummy variables to use that in a regression analysis, one of those categories is left out. So we are leaving out the 'blue collar' category here and now these effects are - not the average prestige of 'professional' but how much is the average difference between a professional occupation and a blue collar occupation? How much is the difference between white collar occupation and a blue collar occupation?

So these regression coefficients refer to differences between the occupations. And one occupation is always used or one category in the categorical variable is used as a reference category. So these can be interpreted only against the 'blue collar'. If we want to compare a 'blue collar', or 'professional' and 'white collar' then we can just manually include the dummies, or indicate manually, which of these categories is left out. That is more advanced.

One more thing that we note here for the first time is that R tells that there are missing observations from the data. So we got four observations that were missing, because they didn't have a type variable. Quite often when you have some missing data then the default action is just to omit those cases, for which a variable doesn't have any values.

There are other more advanced techniques, but if the number of observations that you drop is small compared to the overall number of data then dropping the cases doesn't really matter.

You are in preview mode.

TU-L0022 - Statistical Research Methods D, Lecture, 25.10.2022-29.3.2023

Categorical independent variables (5:15)

Students

Teachers

About service