TU-L0022 - Statistical Research Methods D, Lecture, 25.10.2022-29.3.2023
Kurssiasetusten perusteella kurssi on päättynyt 29.03.2023 Etsi kursseja: TU-L0022
Non-linear effects with log transformation (13:17)
Description to be added.
This
video explains the different relationships between one dependent
variable and one or more independent variables with the use of
non-linear effects with log transformation technique. This video
introduces and goes through the use of log transformation technique.
Click to view transcript
Regression
analysis tells us about the relationship between one dependent variable
and one or more independent variables. One of the problems with
regression analysis or one of the limitations is that it focuses on
linear relationships only. However, many relationships in nature and
social life are nonlinear in nature. And one very useful technique for
dealing with that kind of relationship is, the log transformation or
logarithm transformation if we write the log in a long form. What
does that do, what does log transformation do? Many papers contain
statements like this. We use the log of the revenue since revenue for
our firms is highly skewed. That's very common, the researchers say
that something is skewed, and we take a log of something to make it more
normal. That has a couple of issues, that kind of statement. But
let's first look at what log transformation does to address skewness. These
are the data from the largest 500 Finnish companies in 2005, the
revenues for those companies. We have one very large company here, then
some companies here and most are here around a few hundred million
euros of revenue. We have a couple of billion-euro companies, and most
companies are in the hundreds of millions range. This distribution is
highly skewed, it means that there is this long tail here, so we have,
most observations are clustered here, and then we have some that go to
this long tail here. This kind of skewed distribution is sometimes
problematic, but we must understand that, for example, regression
analysis makes no assumptions about, how observed variables are
distributed. It makes some assumptions, but the distribution of
observed variables is not one of those. If
we take a logarithm of this, every revenue here, we get the
distribution that looks like that. We get something that doesn't have
as a long tail as before, so now the observations are more closely
clustered around the mean, there is still some tail here, but not as
severe. These units here are now logarithms. I'm using base 10 here
for ease of exposition but normally we use the natural logarithm, it
doesn't really make a difference for your analysis. This is the 100
million thresholds, this is the 1 billion thresholds, then we have 10
billion and then 100 billion thresholds here. We change the scale of
the variable by taking a logarithm. What
does the logarithm transformation do? It changes the shape of the
distribution, so this is highly skewed, this is still skewed but less
so. In some cases, it reduces the skewness of data, but that's not the
reason why we actually use it. We don't need our data to be normal but
instead sometimes thinking in terms of relative units makes a lot more
sense than thinking in terms of absolute units. Absolute units here
mean that the difference between 0 and 1 billion is the same as 1
billion and 2 billion. Let's
think for a while, does it make sense to say that when a company grows
to 0 to 1 billion is it the same kind of transformation for the company
as when it grows from 1 billion to two billion? No, that doesn't make
any sense. Also, companies nearly don't say that we grew this in this
many euros, instead, we grew by 10% or 15% compared to the previous
year's revenue. Quite often we like to compare things in relative
terms. You get your salary increases based on labor union negotiations,
they are hardly ever fixed euro amounts, they are 1 % - 2 %, something
related to your current salary level. They are relative units. Here
the relative units mean that the difference between 1 billion, or 100
million and 1 billion is relatively the same as the difference between 1
billion and 10 billion. Each space between these two ticks doesn’t
refer to unit increase, instead, it refers to a tenfold
increase. Things increase relative to the previous level. Let's
take a look at, what it means to run a regression analysis with log
transformation, and why would we want to do that? Transforming the
variables to be less skewed is not the right reason to use log
transformation and if you want to reduce skewness, you, of course, can
do log transformation, but you have to understand that there are other
more important reasons to use log transformation and it also influences
how you interpret your results. This is the example data set from the
Prestige data set, these are occupations from the Canada census of
1930-70-something. And we have the prestige score of occupation and
then the average income of an occupation. We're interested in learning;
how much income depends on prestige. We can see that there is a linear
effect here, prestige goes from 20 to 80, and first income increases,
and then it starts to increase in a nonlinear fashion. If we were to
draw a line or a curve, it would first go flat and then it would curve
up a bit. The line here is not the best description of the data. We
can see here that these observations are below the regression line, and
these are above the regression line. Instead of fitting a line, fitting
some kind of curve that bends up would be better, something like
that. Instead
of saying that these are characterized by a line, we say that these
observations are characterized by this blue curve here. And that is,
what the log transformation does for us and it's the important reason
why we use it. Instead of saying that income increases as a constant
function of prestige, we say that income increases as a relative
function to the current level of income, as a function of
prestige. Let's take a log transformation of income and run a
regression analysis. Here's
my regression analysis. This is the income, done with R, using this
data. We can see the one unit increase in prestige leads to 176
Canadian dollars more per year, and then when we have a log of
income, then log of income increases by 0.03, for every additional
unit of prestige. The problem with this, we know that the log first has
a slightly higher R-squared and also slightly higher adjusted
R-squared, than the income. Based on that metric, we can make an
informed judgment that this is could be a better model. It's not
certain that a better or a higher R-squared means that it's a better
model, but it could be. How we judge models will come up later in
videos. How
do we interpret? What does this 0.03 increase in the log of revenue,
log of income mean? For most people, the metric of a log of income
doesn’t have any meaning. Someone tells me that the logarithm of your
income will increase by 0.01, I know what it means because I've done
this, I’ve read my statistics books, most people don't. How do we
interpret? There are two ways of interpreting the log transformation
results. One is the general way of interpreting any nonlinear effects,
and that is plotting. You can do this, here are the regression results
for the log transformation model. What we do here is that we calculate
the fitted values of the logarithm of income based on prestige. This is
simply taking the formula, adding intercept 7.46 plus 0.02 times
20. That provides us with the fitted income. And the hat here denotes
that this is a fitted value from the regression analysis. Then we take
exponentials of these incomes. When you take a logarithm of a number,
you get another number. When you apply exponential to that other
number, you get back your original number. We say that the exponential
is the inverse function of a logarithm, and logarithm is the inverse
function of an exponential. Because we can apply 1 to get back the
original number, that was used as an input for the other. Exponential
transformation allows us to kind of undo the log transformation, and we
get these predicted incomes for each prestigious level. Then
we plot the data, so we plot these exponentiated logs or predicting
logs of income here, and as a function of prestige, we get this
curve. Whenever you don't know, how to interpret a particular
regression estimate that has been calculated based on some
transformation. One very good way of doing that is to plot the
effect. You can also plot the linear model effects only and then you
can compare, which one looks more reasonable. Here the blue curve, the
log-transformed results, look a lot more reasonable explanation for
the data than the red line. That is one way, the general way that you
can interpret any nonlinear effects. And this kind of plot, where you
draw a line, it’s called a marginal prediction plot. We will cover this
later in the course. Another
way of interpreting regression analysis results after log
transformation is to interpret them directly. Log transformation is a
special case of transformations because it has a natural
interpretation. These interpretations are given by Wooldridge's book
here. When we take the log of the dependent variable then each of these
regression coefficients, here only for prestige, change their
meaning. The meaning of this unit increase of prestige is translated to
relative increase. Beta1 of prestige here, doesn’t tell us, what is
the unit increase of prestige, what is that's the effect on
income? Instead, it tells, what is the effect of one unit increase of
prestige on the relative income. If the regression coefficient of
prestige is 0.025, like it's here, then it means that one unit increase
in prestige leads to a 2.5 % increase in salary, compared to a current
salary level. It's an exponential growth model, that’s why we use the
exponential function. Every time your prestige of occupation increases
by one, then your salary goes up 2.5 % compared to the previous
level. Calculating, how much for example ten units would mean, could be
a bit difficult because we have to take 2 % and then apply that ten
times, 2.5 %. So it's a 0.025 to the power of 10 and then you will get
the effect of a 10 unit increase of prestige. In practice, your
statistical software will do the calculations of the marginal effect for
you. Doing
a plot like that would simplify the interpretation because you can see
directly, what is the effect of moving from prestige of 40 to prestige
of 60 by taking the line. Also, the software will give you the numbers
behind these plots. That's how you calculate marginal effects. The
actual calculation is covered in a different video.