TU-L0022_aalto-CUR-141790-3063741: Instrumental variable solution to endogeneity (11:03)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Instrumental variable solution to endogeneity (11:03)

Vaatii arvosanan

The video presents an instrumental variable as a solution to endogeneity and provides an example of the instrumental variable application.

Click to view transcript

In this video, I will explain the instrumental variable solution to the endogeneity problem. To understand the instrumental variable solution, we first need to understand the endogeneity problem. I explain the problem in more detail in another video. But this is just a quick recap.

Endogeneity occurs when we have a regression model, such as the one here, shown graphically here as a path diagram. The error term U presents any other causes of Y, that are not included as the explanatory variables. So if any other cause of Y is correlated with any of the included variables X, then we have an endogeneity problem.

For example, if we are trying to explain a company's performance, let's say ROA here, and we are trying to explain performance with whether a company invests in a new manufacturing plant or not. Then both investment and profitability probably depend on company's strategy, in which case strategy is an omitted variable that is correlated with X1, and that leads to an endogeneity problem.

More generally, if we look at the problem of X and Y, in just a bivariate case, in more detail, we have the correlation with X and Y is this direct path here plus this correlation times 1, because the path for there is constrained to be 1. So the correlation between X and Y is the direct regression path plus the spurious correlation because X correlates with the omitted cause U.

The problem is that we just observed this one correlation, so we have one unit of information from the data, and we want to estimate two different parameters. This is an underidentified model, the degrees of freedom is -1, which means that the model can't be meaningfully estimated. So we can't estimate two different things from one thing.

To solve this problem, we can apply instrumental variables. The idea of an instrumental variable is that we get a third variable Z that is correlated with X, that is a correlation that we can test empirically, and that we can assume it's uncorrelated with U, any other causes. What qualifies how we find these instruments is a difficult problem because we cannot generally test the correlation between Z and U empirically, we have to argue that based on theory. I'll show you an example soon, but let's take a look at the principle first.

So let's assume we have a valid instrumental variable. So that the only reason why Z and Y are correlated is because Z is correlated with X, and then Z can't be correlated with Y. So when we have these correlations, the correlation between Z and Y is then the correlation between X and Z, comes from the path analysis tracing rules, so we take that correlation, and then this direct path to get from Z to Y. So the correlation between Y and Z is beta times correlation X and Z. And from here we can solve for B, using correlations X, Z and correlation ZY, which are both observable quantities and that gives us a consistent estimate of beta. So that's a way to estimate beta.

Variable Z qualifies as an instrumental variable, if it qualifies for two criteria. First, it must have relevance for X. So X and Z must be correlated, that can be checked empirically, you just have to calculate the correlation, and we do a statistical test for the correlation. Then there are exclusion criteria, which has to be argued based on the theory. Because we don't observe U, we can't test whether Z and U are uncorrelated, that has to be argued based on theory. That is difficult to do.

Let's take a look at an example. So in Mochon's paper, they apply instrumental variables. To understand the instrumental variable used here, we have to understand first, what is the endogeneity problem that they are doing. So what's the issue, why instrumental variables. Their dependent variable was point acquisition, so people are acquiring points in a service. And they are testing whether the decisions to like the Facebook page of that service leads to more point acquisition. And they did an experiment, so they have this randomization step here, they invited some people to like the page that they were studying, and the rest were control. So this is randomization and it is exogenous because there's no reasonable way that a random number generated on my computer will be correlated with the behavior of actual people. So it's very implausible to claim that this would not be exogenous. So randomization is exogenous.

Then we have endogenous selection. The reason why this selection is endogenous is that, when you're invited to like a Facebook page of a service, whether you accept the invitation or not probably depends on how much you like the service, how much you use the service, and so on. So there are probably multiple different causes that influence whether you choose to accept the invitation to like the service that also influence how active you are in the service, acquiring points. So comparing those that chose not to like against those that did like the page is not a valid comparison because these two groups of people are not comparable. That is, we have an endogenous selection here. So we have basically a few options. We can compare between treatment and control here, but that doesn't really give us the effect of the like, because these people in the treatment, some of them chose not to like the Facebook page. Also, some people in the control could have liked the page anyway. So comparing the treatment and control on points acquisition doesn't really allow us to do what we want to do. We can't compare between chose to like and chose not to like, because this is an endogenous selection. And we can't compare these that chose to like against control, because the control contains people that would have chosen not to like, had they been asked. So these two are not comparable either.

What we can do here, and what Mochon et al did, they applied the instrumental variable technique. So the idea is that, the treatment, the randomization here is correlated with choosing to like. So if you ask some people to like a Facebook page and you don't ask the other group, then those people that you ask are more likely to actually like the page. And this can be established empirically. So they can calculate this correlation here, and they can establish that the treatment is a relevant instrumental variable for choosing to like. So it fills the relevance criteria. The treatment also fills the exclusion criteria. Because the treatment is randomized, it is very unlikely that this treatment actually correlates with any other reason that an individual person would have used to like the page. So when we have a random number basically on our computer, which assigns people to treatment or control, then that is independent of any attribute of those people that we randomized. So it fills the exclusion criteria.

Then they can apply these equations to calculate, what is the effect of one way to Facebook like? In practice, we don't work with these equations, because we usually have multiple different variables, we have controls and we can have multiple instrumental variables as well. So we use some other technique. And one of the simplest techniques is called two-stage least squares. The idea of a two-stage least squares is that, when we take the instrumental variable Z, then instead of just saying that these are correlated, we regress X on Z and then we calculate things based on these regressions.

Let's see how it works. So this is an endogenous regression analysis. We have Y, if we regress Y on X, we have an endogeneity problem because some causes of X are correlated with some causes of Y. Then we have the instrumental variable here, Z. So we say that X is actually a sum of Z multiplied by beta2, plus the error term from that regression analysis. So we have the regression analysis for the first regression of X on Z here. And then we have, that makes the second regression. Then we can multiply this out, so we have these beta1, beta2, Z - that's the effect. And this is typically implemented by running two sets of regression. So this beta2 Z is a fitted value of a regression analysis of X on Z. So in practice, we implement this model, by first regressing X on Z, then we take the fitted values of Z, and then we regress Y on the fitted values of X from the first regression. So we run the first regression to get fitted values, then we run the second regression on the fitted values and that gives us consistent estimates of this relationship. If you have more than one independent variable, if we have five independent variables, then we regress each one of those five independent variables on the instruments separately. If we have variables that are not endogenous, then they qualify as instruments as well. We take fitted values of each of those five regression analyses and use those fitted values to explain Y. And that will produce consistent estimates of beta Y, under the assumption that Z is relevant and does not correlate with the omitted causes of Y.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Instrumental variable solution to endogeneity (11:03)

Opiskelijoille

Opettajille

Palvelusta