TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Instrumental variable solution to endogeneity (11:03)
The video presents an instrumental variable as a solution to endogeneity
and provides an example of the instrumental variable application.
Click to view transcript
In
this video, I will explain the instrumental variable solution to the
endogeneity problem. To understand the instrumental variable solution,
we first need to understand the endogeneity problem. I explain the
problem in more detail in another video. But this is just a quick recap.
Endogeneity
occurs when we have a regression model, such as the one here, shown
graphically here as a path diagram. The error term U presents any other
causes of Y, that are not included as the explanatory variables. So if
any other cause of Y is correlated with any of the included variables X,
then we have an endogeneity problem.
For example, if we are
trying to explain a company's performance, let's say ROA here, and we
are trying to explain performance with whether a company invests in a
new manufacturing plant or not. Then both investment and profitability
probably depend on company's strategy, in which case strategy is an
omitted variable that is correlated with X1, and that leads to an
endogeneity problem.
More generally, if we look at the problem of
X and Y, in just a bivariate case, in more detail, we have the
correlation with X and Y is this direct path here plus this correlation
times 1, because the path for there is constrained to be 1. So the
correlation between X and Y is the direct regression path plus the
spurious correlation because X correlates with the omitted cause U.
The
problem is that we just observed this one correlation, so we have one
unit of information from the data, and we want to estimate two different
parameters. This is an underidentified model, the degrees of freedom is
-1, which means that the model can't be meaningfully estimated. So we
can't estimate two different things from one thing.
To solve this
problem, we can apply instrumental variables. The idea of an
instrumental variable is that we get a third variable Z that is
correlated with X, that is a correlation that we can test empirically,
and that we can assume it's uncorrelated with U, any other causes. What
qualifies how we find these instruments is a difficult problem because
we cannot generally test the correlation between Z and U empirically, we
have to argue that based on theory. I'll show you an example soon, but
let's take a look at the principle first.
So let's assume we have
a valid instrumental variable. So that the only reason why Z and Y are
correlated is because Z is correlated with X, and then Z can't be
correlated with Y. So when we have these correlations, the correlation
between Z and Y is then the correlation between X and Z, comes from the
path analysis tracing rules, so we take that correlation, and then this
direct path to get from Z to Y. So the correlation between Y and Z is
beta times correlation X and Z. And from here we can solve for B, using
correlations X, Z and correlation ZY, which are both observable
quantities and that gives us a consistent estimate of beta. So that's a
way to estimate beta.
Variable Z qualifies as an instrumental
variable, if it qualifies for two criteria. First, it must have
relevance for X. So X and Z must be correlated, that can be checked
empirically, you just have to calculate the correlation, and we do a
statistical test for the correlation. Then there are exclusion criteria,
which has to be argued based on the theory. Because we don't observe U,
we can't test whether Z and U are uncorrelated, that has to be argued
based on theory. That is difficult to do.
Let's take a look at an
example. So in Mochon's paper, they apply instrumental variables. To
understand the instrumental variable used here, we have to understand
first, what is the endogeneity problem that they are doing. So what's
the issue, why instrumental variables. Their dependent variable was
point acquisition, so people are acquiring points in a service. And they
are testing whether the decisions to like the Facebook page of that
service leads to more point acquisition. And they did an experiment, so
they have this randomization step here, they invited some people to like
the page that they were studying, and the rest were control. So this is
randomization and it is exogenous because there's no reasonable way
that a random number generated on my computer will be correlated with
the behavior of actual people. So it's very implausible to claim that
this would not be exogenous. So randomization is exogenous.
Then
we have endogenous selection. The reason why this selection is
endogenous is that, when you're invited to like a Facebook page of a
service, whether you accept the invitation or not probably depends on
how much you like the service, how much you use the service, and so on.
So there are probably multiple different causes that influence whether
you choose to accept the invitation to like the service that also
influence how active you are in the service, acquiring points. So
comparing those that chose not to like against those that did like the
page is not a valid comparison because these two groups of people are
not comparable. That is, we have an endogenous selection here. So we
have basically a few options. We can compare between treatment and
control here, but that doesn't really give us the effect of the like,
because these people in the treatment, some of them chose not to like
the Facebook page. Also, some people in the control could have liked the
page anyway. So comparing the treatment and control on points
acquisition doesn't really allow us to do what we want to do. We can't
compare between chose to like and chose not to like, because this is an
endogenous selection. And we can't compare these that chose to like
against control, because the control contains people that would have
chosen not to like, had they been asked. So these two are not comparable
either.
What we can do here, and what Mochon et al did, they
applied the instrumental variable technique. So the idea is that, the
treatment, the randomization here is correlated with choosing to like.
So if you ask some people to like a Facebook page and you don't ask the
other group, then those people that you ask are more likely to actually
like the page. And this can be established empirically. So they can
calculate this correlation here, and they can establish that the
treatment is a relevant instrumental variable for choosing to like. So
it fills the relevance criteria. The treatment also fills the exclusion
criteria. Because the treatment is randomized, it is very unlikely that
this treatment actually correlates with any other reason that an
individual person would have used to like the page. So when we have a
random number basically on our computer, which assigns people to
treatment or control, then that is independent of any attribute of those
people that we randomized. So it fills the exclusion criteria.
Then
they can apply these equations to calculate, what is the effect of one
way to Facebook like? In practice, we don't work with these equations,
because we usually have multiple different variables, we have controls
and we can have multiple instrumental variables as well. So we use some
other technique. And one of the simplest techniques is called two-stage
least squares. The idea of a two-stage least squares is that, when we
take the instrumental variable Z, then instead of just saying that these
are correlated, we regress X on Z and then we calculate things based on
these regressions.
Let's see how it works. So this is an
endogenous regression analysis. We have Y, if we regress Y on X, we have
an endogeneity problem because some causes of X are correlated with
some causes of Y. Then we have the instrumental variable here, Z. So we
say that X is actually a sum of Z multiplied by beta2, plus the error
term from that regression analysis. So we have the regression analysis
for the first regression of X on Z here. And then we have, that makes
the second regression. Then we can multiply this out, so we have these
beta1, beta2, Z - that's the effect. And this is typically implemented
by running two sets of regression. So this beta2 Z is a fitted value of a
regression analysis of X on Z. So in practice, we implement this model,
by first regressing X on Z, then we take the fitted values of Z, and
then we regress Y on the fitted values of X from the first regression.
So we run the first regression to get fitted values, then we run the
second regression on the fitted values and that gives us consistent
estimates of this relationship. If you have more than one independent
variable, if we have five independent variables, then we regress each
one of those five independent variables on the instruments separately.
If we have variables that are not endogenous, then they qualify as
instruments as well. We take fitted values of each of those five
regression analyses and use those fitted values to explain Y. And that
will produce consistent estimates of beta Y, under the assumption that Z
is relevant and does not correlate with the omitted causes of Y.