TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022
Modeling the within effect with OLS regression (14:53)
Description to be added.
This video explains how OLS regression can be applied to estimate the within effect from multi-level data.
Click to view transcript
Normal
regression analysis can be used to estimate models with multi-level
data. While normal regression analysis is not always the ideal technique
for doing so there are a couple of simple strategies that can be
applied to estimate models with such data. These techniques provide a
starting point for understanding more complex analysis techniques. Let's
take a look at how OLS regression can be applied to estimate the within
effect from multi-level data.
Our example data is these 15
companies observed over 10 years that we have looked at in a previous
video and we can see that the within effect and the between effect are
not the same. So these companies that are invested only a little in
R&D are less profitable than these companies that are investing
heavily in R&D. Nevertheless, the within effect in a company is
negative. So when a company increases their R&D investment such as
this company here then profitability will go down. So the between effect
is positive and the within effect is negative and we want to understand
how to estimate the within effect from this data. So we want to take
these two effects apart and estimate the within effect.
The
within effect would be important for example for informing policy on a
firm level. So should a firm increase or decrease their R&D
investments if they care about their profitability? And that is a
question that the within effect could answer in this case. We could of
course estimate the separate regression model for each company. So we
have 15 companies let's split the data to 15 subsamples and run a
regression analysis on each company which are done here and so the lines
is shown graphically but the problem is that then we only have ten
observations for each company which is a very small number and also we
get 15 different regression coefficients and typically we just want to
report one. So how do we get the within effect? There are two very easy
strategies for doing so.
The first strategy is to use dummy
variables. So the idea of a dummy variable is that if we have 15
companies then we create 15 variables. And those variables indicate
which company that observation is for. So we have originally the
variable firm which receives 15 different values and then we create 15
new variables firm 1 to firm 15 and the first firm one variable receives
value of 1 for the first firm and value of 0 for the other firms. The
second dummy variable receives the value of 1 for the second case and 0
otherwise. So these dummy variables indicate to which firm that
observation belongs to. And a dummy variable is defined in a way that
just one variable at a time for an observation receives one and all the
others are zeros here. So this indicates that this observation belongs
to firm one and not any other firm. So all these are zeros. And how do
we apply these in a regression analysis and what's the outcome?
When
we add all the dummies in a regression model then typically your
regression software will drop one from the model. So here firm one has
been omitted. The reason is that the dummy variables, if you add all of
those dummies in the model they will be perfectly collinear with the
intercept. So in practice, we omit typically the first dummy. So we only
have firm 2 to firm 14 dummies and then firm one is a reference
category. So the idea is that the profitability of firm 1 when R&D
is 0 is given by the intercept and then firm 2 dummy gives the average
difference between firm 1 and firm 2 when R&D investments are held
constant. So these dummies don't indicate any absolute levels but they
indicate the difference between the focal firm, firm two for example,
with the reference category firm one.
Quite often we wouldn't
interpret these dummies because they are quite a few of them and
typically we are not interested in specific cases. We're interested in
how the regression line goes controlling for the fact that we have data
from multiple different companies. So this is the first strategy. We
estimate the dummy variable so we basically allow each company to have a
specific intercept that is estimated from the data and then these
companies have regression lines with the same slope. So each company
basically receives the same regression line except that the intercept
can be different. So that is one easy strategy, we model the differences
between these companies.
The second strategy is within firm
centering. In this strategy we don't model the constant differences or
stable differences between companies instead we eliminate the
differences between the firms or companies before the actual regression
analysis. So what we do is that we take the R&D, the explanatory
variable and profitability the dependent variable and we calculate the
cluster mean of both of these variables. So we have R&D m which
stands for R&D mean and we calculate the mean R&D. For the first
company it's 18% then we calculate the mean R&D for the second
company and it's 6.4 % and so on. Then we do the same, we center the
R&D by subtracting the cluster mean from the original value. So this
centered R&D period C is how much that observation differs from the
mean value of the company.
So all these R&D C's sum to 0
within a company. We do the same for the profitability. So we have the
mean profitability and then the mean central profitability. This
eliminates any systematic differences between companies because after
the within-firm centering all variables have means of zeros within a
firm. So the within-firm differences disappear from the data. Then we
run a regression analysis and we just use the mean centered dependent
variable the mean standard independent variable and we get the same
regression estimate as before which is the within effect. So this is a
regression analysis where all between effects and all contextual effects
have been eliminated from the data. What remains is the within effect
which is estimated.
Let's compare the three models. First, we
have a model that ignores clustering. We just run a normal regression
analysis of profitability on R&D. Then we have the dummy variable
model and then we have the within-firm centering model. We can see that
the coefficients here for the dummy variable model and for the centering
model are the same so it's -0.418 and this is the within effect. So
both of these techniques produce the exact same estimate and that is the
estimate of the within effect. Then if we ignore clustering we get the
population average effect. So the population average effect just gives
us the regression coefficient ignoring clustering and it's very
difficult to give any causal interpretations to that effect. The within
effect has a causal interpretation in how much can we expect the
profitability of one firm to increase if that firm increases their
R&D investments by one unit. But there are some interesting features
when we compare the dummy variable model the within firms and
particularly the within-firm centering model.
The first is that
the R square values are quite different. So for the first model it is
31% second model is 70% and the third model is 20%. So why such large
differences? Well, this R square here is kind of like, it quantifies how
much the within effect and between effect together explain the data in
sort of a way. It doesn't really quantify that precisely because if the
within effect and between effect are not the same then estimating two
different effects will give you a higher R square. But it's roughly. So
how much R&D generally explains profitability.
Then we have
the 70% variation here in the dummy variables. So what is this 70% R
square? It quantifies how much the unobserved heterogeneity term, how
much the contextual effect and how much the within effect together
explain the data. So if we eliminate all those three sources of variance
in the data there is still 30% of the variation that is unexplained.
Then the within-firm centering gives us 20% R square and this is roughly
how much R&D explained within-firm variation. So if we want to
understand how much R&D investment influences the variation of an
individual company's performance then this R square of 20% would answer
that question.
So which one should you report? It's something
that you should really understand why these are different but if you
don't know which one you should report, typically these within-firm
centering R square is something that is most useful because it is a
clear interpretation of R square of a particular effect: how much
R&D influences variation of company performance within that firm
whereas the dummy variable and ignore clustering R squares they combine
explanation on at least two different levels.
Then there is
another interesting feature. It's that while these estimates from the
dummy variable model and within-firm centering are exactly the same the
standard errors are not the same. So what does that mean? Standard error
quantifies how much we expect the coefficient to vary if we repeat the
same analysis over and over from repeated samples of the same
population. The dummy variable model and the within centering model have
been proven to produce the same results. So their variation, the real
variation from one sample to another should be exactly or is exactly the
same. So how come standard errors are different? And if the variation
of this dummy variable coefficient and this within-firm coefficient is
actually the same then one of these standard errors must be incorrect
because they quantify both the same variation in the hypothetical
scenario of repeated analysis.
It turns out that this within
firms centering standard error is actually biased and inconsistent. So
this underestimates the variability of the regression coefficient. The
reason is that when we within-firm center we also take out some
variation of the error term and a variation of the error term is used to
estimate the standard error. So the within-firm centering strategy
should actually never be applied in practice to the dependent variable
because the standard errors will be inconsistent. If you do so you have
to apply a correction. There are analysis techniques such as generalized
least squares that do this kind of centering but those techniques also
apply the correction to the standard errors. So if you want to centre
the dependent variable you should always do so by using one of the
canned procedures of your statistical software.
So these are two
simple strategies and well there is a third simple strategy, run a
separate regression analysis for each company but then that run has the
problem that you have a large number of models, with very small sample
sizes each and how would you aggregate the results for interpretation.
So this is typically not something that people would consider. The dummy
variable regression is actually a useful technique if you have a small
number of cases. The problem with that is that R square is difficult to
interpret and the centering technique is something that you should not
use at least you should never center dependent variable.
So how
should you actually model this data? The dummy variables are okay but
there are also other techniques. So the more advanced techniques for
multi-level modeling, and these are actually more commonly used
techniques for multi-level data than the normal regression analysis, can
be categorized based on one assumption. So if you can assume that there
are no contextual effects of the variables of interest econometricians
say that the random effects assumption holds, I have another video about
that assumption, then you can apply some of these techniques. You can
apply generalized least squares random effects estimation, maximum
likelihood estimation of random intercept models or you can apply
generalized estimation equation technique, or you can apply normal
regression analysis with cluster robust standard errors.
If you
cannot assume that the contextual effects are zero. If you know or you
have an idea that they may be non-zero then you can use generalized
least squares fixed-effect regression analysis or alternatively, you can
use any of these analysis techniques and then use cluster means of the
interesting variables as controls. So recall that cluster means where
the means of the variables within clusters that you calculate when you
do the cluster mean centering procedure.