TU-L0022_aalto-CUR-141790-3063741: Testing linear hypotheses after regression (8:57)

Home Schools Course feedback Service Links Intelliboard

This course space end date is set to 06.04.2022 Search Courses: TU-L0022

Testing linear hypotheses after regression (8:57)

Receive a grade

This video explains why a researcher would find linear hypothesis testing useful, how to use the Wald test and which statistical tests should be used based on the situation. Additionally, evaluating covariance and when to compare regression coefficients is explained.

Click to view transcript

Regression analysis will give you estimates of regression coefficients and statistical tests of whether those coefficients are different from zero in the population. Sometimes, however, it is very useful to be able to test other hypotheses. For example, if a coefficient differs from a value other than 0 or if two coefficients are the same in the population. To do that, we need to understand how we test a linear hypothesis after regression analysis.

So let's take an example of regression on prestige, education, women and type of occupation, using the Prestige data that we have been using before. So we get some regression estimates and we'll be focusing on these dummy variables. So the effects of professional and white collar, here, tell what is the difference between, or the expected difference between, professional occupations and blue collar occupations, and white collar occupations and blue collar occupations. So the regression coefficients here are differences related to a reference category, which is the blue collar.

However, sometimes knowing the difference between the categories and a reference category is not enough. What if we wanted to know, what's the difference between professional and white collar, and is that statistically significant?

The difference between professional and white collar occupations is simply the sum of these two estimates, so it's about 10. But is that difference statistically significant? So we need to get a p-value. We can see that the p-value for professionals is about -0.08 for an estimate of 7, and based on that, considering that the difference between professionals and blue collars is 10, we could conclude that maybe the difference of 10 is significant when a difference of 7 is close to significant.

However, we need to do a proper test to assess whether that's the case. To do that, we use the Wald test. And here, the Wald tests that, the null hypothesis that I have in mind, the type professionals coefficient is the same as the type white collar. To calculate the Wald test, we have to take an estimate squared, divide it by standard error squared. So how do we do that? We have to define, what is the estimate here? And what is the standard error here?

To define the estimate, we will now write the null hypothesis in a slightly different way. So we'll write it that way. So if type professional equals type white collar, and then type professional minus type white collar equals zero. So we have something here, that we compare against zero in the population.

So this is our estimate: what is the estimated difference of type professionals, type white collars, and then we raise it to the second power. So that's easy enough. How about the standard error squared? We have to understand, what does the standard error quantify?

So the standard error quantifies the estimate of the standard deviation of this estimate if we repeat the sample, the same random sample, over and over from the same population. So how much does this estimate varies because of sampling fluctuations?

In our case, the standard error squared is the estimated standard deviation squared, and standard deviation squared is the same as variance. So we have estimate squared divided by the variance of the estimate. So how do we calculate the variance of the estimate now?

We have the estimate, which is the type professional minus type white collar. We can plug in these numbers, we get about -10. And we raise it to the second power, we get about 100. And then we divide it by the variance of that estimate. But how do we do that? We need this kind of equation, so that's the estimate, that's easy enough.

And when we have the difference between two variables, type professional and type white collar, they both vary. Then the variance of this difference is the variance of both variables summed minus two times the covariance between these two variables.

You can check the covariance calculation rule in this Wikipedia link. Or a favorite regression book, if it's a good book, will also explain how covariances are calculated. So we know the type professional variation and type white collar variation, those are the standard errors.

But what's this term here, this covariance between estimates. We can think of the covariance of these two estimates as what will happen if the blue collar occupations, that we use as a reference category, what if the prestige of those is a bit lower?

So if the blue collar occupations' prestige is a bit lower, it means that both type professional and type white collars, which are evaluated against the blue collars' prestige, both increase a bit. So when these two estimates vary over repeated samples, then they will also covary. So they will be correlated in repeated samples most of the time.

The variance-covariance matrix of the estimates is something that the regression analysis will provide for you. And here is the covariance matrix for the estimates for our example. So square root of this variance here is the standard error, you can verify with your hand calculator. And the square root of this variance here is the standard error for type white collars. And here's the covariance between these two estimates. So this is something that the regression analysis software provides for you. You don't have to understand how it's calculated.

Then we take the numbers here, we plug them here to this equation, and we get an answer of 12.325. We compare that 12.325 against the chi-square distribution with one degree of freedom, or we compare them against the proper F-distribution, because this is regression analysis and we know regression analysis, how it behaves in small samples. If we didn't, we would use the chi-square distribution.

So whether you use the F-distribution or the chi-square to compare this against, depends on the same consideration as whether you would you be using z-test or t-test. If you are using statistics that have only been proven in large samples, then you use the z-test and chi-square. If you use statistics that we know, how they behave in small samples, then you use a t-test and an F-test. But you don't have to check that from your statistics book because your computer software will do all this calculation for you.

So in R, we can just use linear hypothesis and then we specify the hypothesis here, the R will calculate the test statistic for you, 12.325, which is the same we got here manually, and it'll give you the proper p-value against the proper F-distribution. So this is a highly significant difference.

This kind of comparison is not only restricted in comparing two categories of a categorical variable. You can also do comparisons of work, for example, whether the effects of women and education are the same, or whether the effects of education is different from, let's say, five. But comparing two regression coefficients comes with a big caveat.

It only makes sense if those two regression coefficients are quantified two variables that are somehow comparable. So you can't really compare number of years of education to share of women, so those are incomparable. In many cases, these kinds of comparisons don't make much sense.

Here, because we have a categorical variable with different categories, they are comparable. So these are categories of the same variable, it makes sense to compare. In some other scenarios, it doesn't. So you really have to think, does the comparison make sense before you can do this kind of statistical test. Because your statistical software will do any test for you, it will not tell you whether the test makes sense. You have the think for yourself.

You are in preview mode.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Testing linear hypotheses after regression (8:57)

Students

Teachers

About service