TU-L0022_aalto-CUR-141790-3063741: The effect of non-independence of observations on regression (13:08)

Home Schools Course feedback Service Links Intelliboard

This course space end date is set to 06.04.2022 Search Courses: TU-L0022

The effect of non-independence of observations on regression (13:08)

Receive a grade

In this video, the impact of non-independent evidence of observations on regression analysis and what kind of problems it might cause for empirical analysis are discussed with the help of several examples.

Click to view transcript

Regression analysis assumes that the sample that you're analyzing is a random sample from the population. That could be violated for example if you have 100 observations but those observations are measured from five different people only each of which is measured 20 times. What is the impact of non-independent evidence of observations on regression analysis and what kind of problems could that cause for empirical analysis? Let's take a look.

So here are the six regression assumptions according to Wooldridge and the second assumption is the independence of observations. So what will happen if the observations are not independent and I will go through this with a couple of examples.

Let's take a simple example where we are interested in estimating the mean of the population. So our sample is 100 observations here and these 100 observations come from five clusters. So let's say we are observing five companies over 20 years or we are measuring reaction times from five people each measured 20 times and we want to know what is the population mean. If interclass correlation is zero or there is no dependence between the observations within a cluster we get a very precise estimate of 0.08 for the mean. The actual population mean is 0 and the population variance is 1. The interclass correlation is 0. What will happen if we increase the interclass correlation? So we make these observations that are yellow, green, and purple we make them closer to one another. So let's start clustering the data. We can see that these yellow observations start the cluster here these purple observations start to go here and green observations go somewhere in the middle. When we increase the interclass correlation of these data so this is maintaining the variance of the data. When we increase the interclass correlation we can see that the estimate of this mean or the sample mean became less and less accurate estimator of the population mean. Originally when we have 100 independent observation our estimate was 0.08 and after we have strongly clustered the data its 0.61.

When the interclass correlation is 1 we have a special case where there is no within cluster varies. So we have 100 observations but only 5 unique values. So if we only have 5 unique values then it doesn't make a difference if we have each of those 5 values 1000 times or just once because we gain no new information about where the population mean is because after we have the first observation from a cluster then the other observations will bring no more new information into the analysis.

So the idea here is that when our data are independent then each observation brings the same amount of new information to the analysis. When the observations are dependent there is interclass correlation then the first observation from a cluster brings lots of new unique information to the analysis but once we have the first observation then the other observations from the same cluster give you less and less data about where the population mean is.

For example, if we want to measure what is the average height of people in the University and we have a measurement tape that contains some measurement errors somehow then it's better to measure 100 people than to measure the same 10 people 10 times. And of course, if you have no measurement error then measuring the same 10 people or same 5 people over and over and over will not improve the precision of your estimate.

So then the problem here is that when interclass correlation increases when there is lack of non-independence then our estimates will be less precise. They are still consistent. They are still unbiased but they are less precise.

Okay so that's one variable what if we have two variables and we want to run a regression analysis? So we have X and we have Y. We still have 100 observations nested in five clusters so we have 20 observations for each cluster. Initially, interclass correlation is 0 so all these observations are independent. There is no particular pattern in the colors and our regression estimates are quite precise. The actual intercept is 0, our estimate is 0.1, the actual slope is 1, our estimate is 1.07, so it's pretty close. That's something that you can expect from 100 observations with one explanatory variable in a regression analysis.

When we increase the interclass correlation of both these variables we can see again that there's some clustering. So these yellow observations go here, and these purple observations go here, green observations go here, and ultimately we are in interclass correlation which is one. We are in a scenario where we have just 5 observations that are repeated and again if we have the same dataset we just repeat the observations that gives us no new information for the estimation problem. The outcome is that when both of these variables have clustering effects then our regression coefficients both the coefficient and the slope will be less and less precise. They are still consistent and they are still unbiased but the effect is the same as it was for the effect of estimating or in the case when we estimated the mean from cluster data.

So in effect interclass correlation decreases our effective sample size. So if we have 100 observations that are strongly clustered it's possible that we actually have only 5 observations worth of information. In less extreme cases we could have something like 100 observations but they actually give us information that is only worth about 20 observations and so on. Things get more interesting if we only have X that is clustered or we only have the error term that is clustered but not the other.

So let's take a look at first what happens when our X is cluster but the error terms are independent. So we can see that the interclass correlation again increases these become more and more clustered until we have just five values. In this case when X is clustered but the error term is not the clustering actually doesn't have an effect. So we can see that the regression coefficient and the internal slope are going to be slightly different when the clustering changes but that's just because when you estimate the same quantity from different samples you will get different results. So there is no systematic effect in the estimates getting worse and worse when interclass correlation increases. The reason for this is that regression analysis actually doesn't make any assumptions about the independent variable. Everything is estimated conditionally on the observed value. So we could have a researcher that sets these X values. For example in an experimental context we actually set these people into the treatment group and into the control group so those are not random variables they are something that we set as researchers and we could, of course, set them however we want and regression analysis would not be affected.

What if our X is random. X doesn't or X is not clustered but the error term is clustered. This is something that would be quite an unusual case but it's nevertheless useful to understand what happens. So when we cluster the error term we effectively reduce the variation or the unique values in the error term and it has one implication. The implication is that this intercept is going to be estimated less precisely but the slope estimate is going to stay about the same. One way to understand why that is the case is that these error term values here, even if we have just one value for each cluster, they will give us still very useful information about the direction of the line but not on how high the line is. As you can see all these when the errors are the exact same for each cluster interclass correlation is 1 then all of these clusters form an exact line that is parallel to the population regression line here, but the intercept is estimated less efficiently.

So this would of course be a very unusual scenario. Typically if you cannot assume that your error term the unobserved sources of variation in the dependent variable are not independent then typically your explanatory variables can be assumed to be independent either. So we either have the case where the X, what error term is independent in which would be the case in random sampling, but X could be non-independent for example due to manipulation or we have the scenario where both of these variables correlate within clusters.

So why would this be a problem? Why is non-independence of observations a problem and what is it a problem for or what is the cause, what does it cause? And as we saw non-independence of observations doesn't lead to bias. It doesn't lead to inconsistency but it leads to less precise estimates and that is something that we just can't do anything about it. If we don't have much information then we can't estimate things precisely. But that's not really a problem per see because we can just state that well we have an estimate but it's not very precise and sometimes we have to just live with that.

The real problem is that if we look at the standard error formula which is derived based on this variance formula where we just put plug in the estimated variance of error term here for the Sigma and we have sum of squared total. This equation here only depends on the variance of the error term, it depends on the variance of the predictor variable and it depends on the sample size. If we have clustering effect in the data we saw that estimates will be less precise even if the variance of the error term and the variance of the predictor and the sample size are the same. And this equation doesn't take the clustering into account. So regardless of whether we have five observations that are each is replicated 20 times in our data, so in effect, our sample size is 100 but it seems to be what larger or when we actually have 100 unique observations, this formula gives us the same result. And the outcome is that when you have clustering then the standard errors are generally estimated inconsistently and they will be negatively biased. So you will overstate the precision of the estimates and that will cause incorrect inference and particularly it can lead to false positive findings rejecting the null hypothesis where in fact it shouldn't be rejected.

So what can we do about this problem? There are a couple of strategies. One is that we use a model that specifically includes some terms in the model, that model the known independence of the error term, which can be quite difficult to do if the pattern of dependency between observations is complex. Another approach is that we use cluster-robust standard errors which will allow you to take an arbitrary correlation structure between observations into account and that is a very general strategy and I will explain that in another video.

You are in preview mode.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

The effect of non-independence of observations on regression (13:08)

Students

Teachers

About service