TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Idea of regression analysis (8:33)
Description to be added.
Regression
analysis is one of the most commonly used statistical analysis
techniques in social sciences because it can be used for ruling out
alternative explanations for observed associations (i.e. correlations).
Regression analysis has one dependent and one or more independent
variables that are assumed to be linearly related. The objective of
regression analysis is to explain the dependent variable as well as
possible using the independent variables. The video presents Venn
diagram, path diagram, and the graphical interpretation of regression
analysis.
Click to view transcript
I
will next explain the idea of regression analysis. Regression analysis
is one of the most used analysis tools in quantitative research, and
most applications of quantitative techniques can be thought of as
special cases or extensions of this particular analysis. The
regression analysis results are typically presented as a table like
this. Here we have four different regression models, we have different
regression coefficients, we have different model indices. There are
certain assumptions behind this table that you need to understand and
you need to understand, what the numbers tell us. And we will be looking
at these kinds of tables in the next couple of videos, but I will first
explain, what is the regression analysis about? In
regression analysis, we have two kinds of variables, we have one
dependent variable that we want to explain, for example, a company's
profitability, ROA could be a dependent variable. Then we have multiple
independent variables. The independent variables are variables that we
use to explain the dependent variable, for example, we could have CEO
gender and company size, and company industry. Then regression analysis
answers the question, how much do these variables together explain the
variation of the dependent variable, and which ones of the variables are
the most important ones for explaining that. Regression analysis allows
us to control for alternative explanations for an observed correlation.
In the case of the paper by Hekman from which the previous table was
from, they explained patient satisfaction scores with, for example,
physician productivity, physician quality, and physician accessibility.
You have one thing that you explain with multiple things to see, which
one of those multiple potential explanatory variables matters. The
idea of regression analysis is commonly presented as a Venn diagram.
This Venn diagram is useful for illustrating some properties of
regression analysis, but it doesn't illustrate all the properties, but
it's a good starting point, nevertheless. The idea of these circles here
is that this circle here presents the variation of company performance
or return on assets. This is the variation of the dependent variable.
This is the variation of the independent variable that we're interested
in, which in this case is the CEO gender. And this is the variation in
company size. Now we are interested in, how much of this co-variation or
correlation between gender and performance is due to gender, and how
much is due to the effect of size because size and gender are
correlated. We could say that the correlation between gender and
performance is partly due to the presumed causal influence of CEO gender
on performance, and partly because smaller companies tend to be more
profitable, this correlation here, and also because smaller companies
tend to more likely hire women CEOs, which is this correlation here. Now
we want to use regression analysis to parcel out this part that is
shared by gender and size and performance, to get the unique effect of
the performance. We could think of regression analysis as doing
something like this. It eliminates the effect of company size on the
relationship between gender and performance. Of course, we are not
limited to just two independent variables, we can have multiple
competing explanations for the dependent variable in the model.
Typically, we would have in the ballpark of 10 or 20 variables. We can
take additional bites away to get a cleaner estimate of this correlation
between gender and performance, that is free of any third causes.
Ultimately, we would get a clean causal effect between gender and
performance if we have included all relevant controls to the model.
That, of course, is easier said than done. Regression
analysis is a statistical model, and a model is an equation, so
whenever you hear up the term model it means that there is some math,
and the model can also be presented as a path diagram like this. I will
first talk about the path diagram. The path diagram here has one
dependent variable y, three independent variables x, and the x's are
independent, they are allowed to be freely correlated. Free correlation
is this double-headed curved arrow, which means that we don't really
care about how these different explanatory variables usually denoted
with x, are related, but we are interested in estimating, how they
explain or predict the dependent variable y. The strength of influence
of each variable is quantified by a regression coefficient beta. We have
one beta for each x here, then we have beta 0 or the intercept, which
tells us the base level of y, when all of these explanatory or
independent variables are at 0. And then we have some variation u, that
the model doesn't explain. This is the remaining variation that is not
explained by the model. Let's say that the model explains 20% of the
variation of the dependent variable, which is fairly typical for
business research. Then the unexplained variation would account for 80%
of the true variation of y, in the data. In
equation form, we can see that the y here is a weighted sum of the x's,
and the weights are the regression coefficients. And each of these
regression coefficients quantifies, what is the influence of one
variable, one of the independent variables on the dependent variable.
For example, we can model patient satisfaction as a weighted sum of
physician productivity, physician quality, physician accessibility, and
some variation that the model doesn't explain. What's important to
understand is that these effects are independent so when x increases one
unit then that beta tells, what is the effect of one unit increase
independently of the other variables? And also, they are linear so that
we always assume that one unit increase in x is always associated to the
same amount of increase in y which is quantified by the beta. Graphically,
regression analysis can be understood as a line. And I will show you a
two-variable regression analysis. This is also called the simple
regression because we have only one independent variable. Here the
independent variable is, let's say it's years of education for example,
and this dependent variable here is let's say it's salary. And
we are interested in knowing, what is the linear relationship, so
what's the best line that explains this data. Regression analysis in
this simple regression with one independent variable, basically, you can
think of it as plotting all the data as a scatterplot here, we will
show some scatter plots a bit later, and then draw a line through the
data, so that gives us the regression line. The slope of this line here,
how strongly it goes up or down, is quantified by the regression
coefficient. We make some assumptions when we run a regression analysis.
One of the key assumptions in justifying regression analysis is that
these observations then are equally and normally distributed around the
regression line. That when we have a regression line here, the most
likely case is that the observations are close to the line, but there
can be some observations that are far from the line, but they should be
relatively rare.