TU-L0022_aalto-CUR-141790-3063741: Modeling the within effect with OLS regression (14:53)

Modeling the within effect with OLS regression (14:53)

Få ett betyg

Description to be added.

This video explains how OLS regression can be applied to estimate the within effect from multi-level data.

Click to view transcript

Normal regression analysis can be used to estimate models with multi-level data. While normal regression analysis is not always the ideal technique for doing so there are a couple of simple strategies that can be applied to estimate models with such data. These techniques provide a starting point for understanding more complex analysis techniques. Let's take a look at how OLS regression can be applied to estimate the within effect from multi-level data.

Our example data is these 15 companies observed over 10 years that we have looked at in a previous video and we can see that the within effect and the between effect are not the same. So these companies that are invested only a little in R&D are less profitable than these companies that are investing heavily in R&D. Nevertheless, the within effect in a company is negative. So when a company increases their R&D investment such as this company here then profitability will go down. So the between effect is positive and the within effect is negative and we want to understand how to estimate the within effect from this data. So we want to take these two effects apart and estimate the within effect.

The within effect would be important for example for informing policy on a firm level. So should a firm increase or decrease their R&D investments if they care about their profitability? And that is a question that the within effect could answer in this case. We could of course estimate the separate regression model for each company. So we have 15 companies let's split the data to 15 subsamples and run a regression analysis on each company which are done here and so the lines is shown graphically but the problem is that then we only have ten observations for each company which is a very small number and also we get 15 different regression coefficients and typically we just want to report one. So how do we get the within effect? There are two very easy strategies for doing so.

The first strategy is to use dummy variables. So the idea of a dummy variable is that if we have 15 companies then we create 15 variables. And those variables indicate which company that observation is for. So we have originally the variable firm which receives 15 different values and then we create 15 new variables firm 1 to firm 15 and the first firm one variable receives value of 1 for the first firm and value of 0 for the other firms. The second dummy variable receives the value of 1 for the second case and 0 otherwise. So these dummy variables indicate to which firm that observation belongs to. And a dummy variable is defined in a way that just one variable at a time for an observation receives one and all the others are zeros here. So this indicates that this observation belongs to firm one and not any other firm. So all these are zeros. And how do we apply these in a regression analysis and what's the outcome?

When we add all the dummies in a regression model then typically your regression software will drop one from the model. So here firm one has been omitted. The reason is that the dummy variables, if you add all of those dummies in the model they will be perfectly collinear with the intercept. So in practice, we omit typically the first dummy. So we only have firm 2 to firm 14 dummies and then firm one is a reference category. So the idea is that the profitability of firm 1 when R&D is 0 is given by the intercept and then firm 2 dummy gives the average difference between firm 1 and firm 2 when R&D investments are held constant. So these dummies don't indicate any absolute levels but they indicate the difference between the focal firm, firm two for example, with the reference category firm one.

Quite often we wouldn't interpret these dummies because they are quite a few of them and typically we are not interested in specific cases. We're interested in how the regression line goes controlling for the fact that we have data from multiple different companies. So this is the first strategy. We estimate the dummy variable so we basically allow each company to have a specific intercept that is estimated from the data and then these companies have regression lines with the same slope. So each company basically receives the same regression line except that the intercept can be different. So that is one easy strategy, we model the differences between these companies.

The second strategy is within firm centering. In this strategy we don't model the constant differences or stable differences between companies instead we eliminate the differences between the firms or companies before the actual regression analysis. So what we do is that we take the R&D, the explanatory variable and profitability the dependent variable and we calculate the cluster mean of both of these variables. So we have R&D m which stands for R&D mean and we calculate the mean R&D. For the first company it's 18% then we calculate the mean R&D for the second company and it's 6.4 % and so on. Then we do the same, we center the R&D by subtracting the cluster mean from the original value. So this centered R&D period C is how much that observation differs from the mean value of the company.

So all these R&D C's sum to 0 within a company. We do the same for the profitability. So we have the mean profitability and then the mean central profitability. This eliminates any systematic differences between companies because after the within-firm centering all variables have means of zeros within a firm. So the within-firm differences disappear from the data. Then we run a regression analysis and we just use the mean centered dependent variable the mean standard independent variable and we get the same regression estimate as before which is the within effect. So this is a regression analysis where all between effects and all contextual effects have been eliminated from the data. What remains is the within effect which is estimated.

Let's compare the three models. First, we have a model that ignores clustering. We just run a normal regression analysis of profitability on R&D. Then we have the dummy variable model and then we have the within-firm centering model. We can see that the coefficients here for the dummy variable model and for the centering model are the same so it's -0.418 and this is the within effect. So both of these techniques produce the exact same estimate and that is the estimate of the within effect. Then if we ignore clustering we get the population average effect. So the population average effect just gives us the regression coefficient ignoring clustering and it's very difficult to give any causal interpretations to that effect. The within effect has a causal interpretation in how much can we expect the profitability of one firm to increase if that firm increases their R&D investments by one unit. But there are some interesting features when we compare the dummy variable model the within firms and particularly the within-firm centering model.

The first is that the R square values are quite different. So for the first model it is 31% second model is 70% and the third model is 20%. So why such large differences? Well, this R square here is kind of like, it quantifies how much the within effect and between effect together explain the data in sort of a way. It doesn't really quantify that precisely because if the within effect and between effect are not the same then estimating two different effects will give you a higher R square. But it's roughly. So how much R&D generally explains profitability.

Then we have the 70% variation here in the dummy variables. So what is this 70% R square? It quantifies how much the unobserved heterogeneity term, how much the contextual effect and how much the within effect together explain the data. So if we eliminate all those three sources of variance in the data there is still 30% of the variation that is unexplained. Then the within-firm centering gives us 20% R square and this is roughly how much R&D explained within-firm variation. So if we want to understand how much R&D investment influences the variation of an individual company's performance then this R square of 20% would answer that question.

So which one should you report? It's something that you should really understand why these are different but if you don't know which one you should report, typically these within-firm centering R square is something that is most useful because it is a clear interpretation of R square of a particular effect: how much R&D influences variation of company performance within that firm whereas the dummy variable and ignore clustering R squares they combine explanation on at least two different levels.

Then there is another interesting feature. It's that while these estimates from the dummy variable model and within-firm centering are exactly the same the standard errors are not the same. So what does that mean? Standard error quantifies how much we expect the coefficient to vary if we repeat the same analysis over and over from repeated samples of the same population. The dummy variable model and the within centering model have been proven to produce the same results. So their variation, the real variation from one sample to another should be exactly or is exactly the same. So how come standard errors are different? And if the variation of this dummy variable coefficient and this within-firm coefficient is actually the same then one of these standard errors must be incorrect because they quantify both the same variation in the hypothetical scenario of repeated analysis.

It turns out that this within firms centering standard error is actually biased and inconsistent. So this underestimates the variability of the regression coefficient. The reason is that when we within-firm center we also take out some variation of the error term and a variation of the error term is used to estimate the standard error. So the within-firm centering strategy should actually never be applied in practice to the dependent variable because the standard errors will be inconsistent. If you do so you have to apply a correction. There are analysis techniques such as generalized least squares that do this kind of centering but those techniques also apply the correction to the standard errors. So if you want to centre the dependent variable you should always do so by using one of the canned procedures of your statistical software.

So these are two simple strategies and well there is a third simple strategy, run a separate regression analysis for each company but then that run has the problem that you have a large number of models, with very small sample sizes each and how would you aggregate the results for interpretation. So this is typically not something that people would consider. The dummy variable regression is actually a useful technique if you have a small number of cases. The problem with that is that R square is difficult to interpret and the centering technique is something that you should not use at least you should never center dependent variable.

So how should you actually model this data? The dummy variables are okay but there are also other techniques. So the more advanced techniques for multi-level modeling, and these are actually more commonly used techniques for multi-level data than the normal regression analysis, can be categorized based on one assumption. So if you can assume that there are no contextual effects of the variables of interest econometricians say that the random effects assumption holds, I have another video about that assumption, then you can apply some of these techniques. You can apply generalized least squares random effects estimation, maximum likelihood estimation of random intercept models or you can apply generalized estimation equation technique, or you can apply normal regression analysis with cluster robust standard errors.

If you cannot assume that the contextual effects are zero. If you know or you have an idea that they may be non-zero then you can use generalized least squares fixed-effect regression analysis or alternatively, you can use any of these analysis techniques and then use cluster means of the interesting variables as controls. So recall that cluster means where the means of the variables within clusters that you calculate when you do the cluster mean centering procedure.

Det här innehållet visas i förhandsgranskningsläge. Ingen spårning av försök kommer att lagras.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Modeling the within effect with OLS regression (14:53)

Students

Teachers

Service