TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022
Categorical independent variables (5:15)
In this video, what is categorical
variables and how to deal with them when they come as independent
variables is explained in detail.
Click to view transcript
Categorical variables are variables that don't really have
any order. So for example, a Country would be a categorical variable
with values of Finland, Sweden and Norway. Our Prestige
dataset has a categorical variable of Occupation Type, and the
categories are - 'blue collar', 'white collar' and 'professional'
workers. So, how do we deal with
these kind of variables? When you have a categorical variable as an
independent variable, things are relatively straightforward. When you
have a categorical variable as a dependent variable then things will get
a bit more complicated. We will now cover the case of a categorical
variable as an independent variable. So
our data looks like that. So, when we take a summary of the prestige
dataset, for the variable 'type', we can see that there are frequencies.
There are no means or standard deviations or such, just frequencies of
different values. We have 44 blue collar professions, 31 professional
professions, and 23 white collar occupations, and then we have four missing values. So
how do we deal with that in a regression analysis? We can't put that as
an independent variable because a unit difference doesn't make sense.
We can't say that the difference between 'blue collar' to a
'professional' is one unit, or the difference between a 'professional'
and 'white collar' is one unit, and the difference between 'blue collar'
and 'white collar' is two units. It doesn't make sense, because we
can't say that there is a magnitude of difference between these values,
and we can't say that there's an order. So
how do we deal with that? We use something called dummy variables. We
code our data like that. So that is a subset of our data. And we have
the variable - 'type' here. Then each observation in the data set gets a code for one of the dummy variables. So
dummy variables are 'type blue collar', 'type professional' and 'type
white collar'. And then the dummy indicates that this first occupation
is a professional occupation. So 'type professional' gets one, others
get zeros. Then we have, a 'type blue collar' is one, for this blue
collar occupation, others are zero. So these are dummy or indicator
variables, and they indicate, which occupation or category each
occupation belongs to. Then we add
the dummies to regression analysis. Stata or R can do that automatically
for you, so you don't have to do that coding manually. If you want to
use SPSS, you have to manually create the dummies. So SPSS doesn't
simplify your life that much in that regard. Then let's take a look at the regression results. When you add a categorical variable in
regression analysis in R and Stata, then we have the categorical
variable here. R will automatically note that this is a categorical
variable and it will produce two dummy variables, 'type professional'
and 'type white collar'. So how do we interpret those results, and where is the 'blue collar professional'? Well
the first thing that we need to understand is that every time, when you
have a categorical variable and dummy variables to use that in a
regression analysis, one of those categories is left out. So we are
leaving out the 'blue collar' category here and now these effects are -
not the average prestige of 'professional' but how much is the average
difference between a professional occupation and a blue collar
occupation? How much is the difference between white collar occupation
and a blue collar occupation? So
these regression coefficients refer to differences between the
occupations. And one occupation is always used or one category in the
categorical variable is used as a reference category. So these can be
interpreted only against the 'blue collar'. If we want to compare a
'blue collar', or 'professional' and 'white collar' then we can just
manually include the dummies, or indicate manually, which of these
categories is left out. That is more advanced. One
more thing that we note here for the first time is that R tells that
there are missing observations from the data. So we got four
observations that were missing, because they didn't have a type
variable. Quite often when you have some missing data then the default
action is just to omit those cases, for which a variable doesn't have
any values. There are other more advanced techniques,
but if the number of observations that you drop is small compared to the
overall number of data then dropping the cases doesn't really matter.