TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
The effect of non-independence of observations on regression (13:08)
In this video, the impact of non-independent evidence of observations on regression analysis and what kind of problems it might cause for empirical analysis are discussed with the help of several examples.
Click to view transcript
Regression analysis
assumes that the sample that you're analyzing is a random sample from
the population. That could be violated for example if you have 100
observations but those observations are measured from five different
people only each of which is measured 20 times. What is the impact of
non-independent evidence of observations on regression analysis and what
kind of problems could that cause for empirical analysis? Let's take a
look. So here are the six regression assumptions
according to Wooldridge and the second assumption is the independence of
observations. So what will happen if the observations are not
independent and I will go through this with a couple of examples. Let's
take a simple example where we are interested in estimating the mean of
the population. So our sample is 100 observations here and these 100
observations come from five clusters. So let's say we are observing five
companies over 20 years or we are measuring reaction times from five
people each measured 20 times and we want to know what is the population
mean. If interclass correlation is zero or there is no dependence
between the observations within a cluster we get a very precise estimate
of 0.08 for the mean. The actual population mean is 0 and the
population variance is 1. The interclass correlation is 0. What will
happen if we increase the interclass correlation? So we make these
observations that are yellow, green, and purple we make them closer to
one another. So let's start clustering the data. We can see that these
yellow observations start the cluster here these purple observations
start to go here and green observations go somewhere in the middle. When
we increase the interclass correlation of these data so this is
maintaining the variance of the data. When we increase the interclass
correlation we can see that the estimate of this mean or the sample mean
became less and less accurate estimator of the population mean.
Originally when we have 100 independent observation our estimate was
0.08 and after we have strongly clustered the data its 0.61. When
the interclass correlation is 1 we have a special case where there is
no within cluster varies. So we have 100 observations but only 5 unique
values. So if we only have 5 unique values then it doesn't make a
difference if we have each of those 5 values 1000 times or just once
because we gain no new information about where the population mean is
because after we have the first observation from a cluster then the
other observations will bring no more new information into the analysis. So
the idea here is that when our data are independent then each
observation brings the same amount of new information to the analysis.
When the observations are dependent there is interclass correlation then
the first observation from a cluster brings lots of new unique
information to the analysis but once we have the first observation then
the other observations from the same cluster give you less and less data
about where the population mean is. For example, if we
want to measure what is the average height of people in the University
and we have a measurement tape that contains some measurement errors
somehow then it's better to measure 100 people than to measure the same
10 people 10 times. And of course, if you have no measurement error then
measuring the same 10 people or same 5 people over and over and over
will not improve the precision of your estimate. So
then the problem here is that when interclass correlation increases
when there is lack of non-independence then our estimates will be less
precise. They are still consistent. They are still unbiased but they are
less precise. Okay so that's one variable what
if we have two variables and we want to run a regression analysis? So we
have X and we have Y. We still have 100 observations nested in five
clusters so we have 20 observations for each cluster. Initially,
interclass correlation is 0 so all these observations are independent.
There is no particular pattern in the colors and our regression
estimates are quite precise. The actual intercept is 0, our estimate is
0.1, the actual slope is 1, our estimate is 1.07, so it's pretty close.
That's something that you can expect from 100 observations with one
explanatory variable in a regression analysis. When we
increase the interclass correlation of both these variables we can see
again that there's some clustering. So these yellow observations go
here, and these purple observations go here, green observations go here,
and ultimately we are in interclass correlation which is one. We are in
a scenario where we have just 5 observations that are repeated and
again if we have the same dataset we just repeat the observations that
gives us no new information for the estimation problem. The outcome is
that when both of these variables have clustering effects then our
regression coefficients both the coefficient and the slope will be less
and less precise. They are still consistent and they are still unbiased
but the effect is the same as it was for the effect of estimating or in
the case when we estimated the mean from cluster data. So
in effect interclass correlation decreases our effective sample size.
So if we have 100 observations that are strongly clustered it's possible
that we actually have only 5 observations worth of information. In less
extreme cases we could have something like 100 observations but they
actually give us information that is only worth about 20 observations
and so on. Things get more interesting if we only have X that is
clustered or we only have the error term that is clustered but not the
other. So let's take a look at first what happens when
our X is cluster but the error terms are independent. So we can see that
the interclass correlation again increases these become more and more
clustered until we have just five values. In this case when X is
clustered but the error term is not the clustering actually doesn't have
an effect. So we can see that the regression coefficient and the
internal slope are going to be slightly different when the clustering
changes but that's just because when you estimate the same quantity from
different samples you will get different results. So there is no
systematic effect in the estimates getting worse and worse when
interclass correlation increases. The reason for this is that regression
analysis actually doesn't make any assumptions about the independent
variable. Everything is estimated conditionally on the observed value.
So we could have a researcher that sets these X values. For example in
an experimental context we actually set these people into the treatment
group and into the control group so those are not random variables they
are something that we set as researchers and we could, of course, set
them however we want and regression analysis would not be affected. What
if our X is random. X doesn't or X is not clustered but the error term
is clustered. This is something that would be quite an unusual case but
it's nevertheless useful to understand what happens. So when we cluster
the error term we effectively reduce the variation or the unique values
in the error term and it has one implication. The implication is that
this intercept is going to be estimated less precisely but the slope
estimate is going to stay about the same. One way to understand why that
is the case is that these error term values here, even if we have just
one value for each cluster, they will give us still very useful
information about the direction of the line but not on how high the line
is. As you can see all these when the errors are the exact same for
each cluster interclass correlation is 1 then all of these clusters form
an exact line that is parallel to the population regression line here,
but the intercept is estimated less efficiently. So this
would of course be a very unusual scenario. Typically if you cannot
assume that your error term the unobserved sources of variation in the
dependent variable are not independent then typically your explanatory
variables can be assumed to be independent either. So we either have the
case where the X, what error term is independent in which would be the
case in random sampling, but X could be non-independent for example due
to manipulation or we have the scenario where both of these variables
correlate within clusters. So why would this be a
problem? Why is non-independence of observations a problem and what is
it a problem for or what is the cause, what does it cause? And as we saw
non-independence of observations doesn't lead to bias. It doesn't lead
to inconsistency but it leads to less precise estimates and that is
something that we just can't do anything about it. If we don't have much
information then we can't estimate things precisely. But that's not
really a problem per see because we can just state that well we have an
estimate but it's not very precise and sometimes we have to just live
with that. The real problem is that if we look at the
standard error formula which is derived based on this variance formula
where we just put plug in the estimated variance of error term here for
the Sigma and we have sum of squared total. This equation here only
depends on the variance of the error term, it depends on the variance of
the predictor variable and it depends on the sample size. If we have
clustering effect in the data we saw that estimates will be less precise
even if the variance of the error term and the variance of the
predictor and the sample size are the same. And this equation doesn't
take the clustering into account. So regardless of whether we have five
observations that are each is replicated 20 times in our data, so in
effect, our sample size is 100 but it seems to be what larger or when we
actually have 100 unique observations, this formula gives us the same
result. And the outcome is that when you have clustering then the
standard errors are generally estimated inconsistently and they will be
negatively biased. So you will overstate the precision of the estimates
and that will cause incorrect inference and particularly it can lead to
false positive findings rejecting the null hypothesis where in fact it
shouldn't be rejected. So what can we do about this
problem? There are a couple of strategies. One is that we use a model
that specifically includes some terms in the model, that model the known
independence of the error term, which can be quite difficult to do if
the pattern of dependency between observations is complex. Another
approach is that we use cluster-robust standard errors which will allow
you to take an arbitrary correlation structure between observations into
account and that is a very general strategy and I will explain that in
another video.