TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Regression example with R (15:03)
In this video data exploration and
basic regression analysis using the statistical software R is
demonstrated with the Prestige data-set.
Click to view transcript
I will now demonstrate regression analysis using R. To demonstrate regression, we need a data set to work with and we will be using the Prestige data set that is used in one of the assignments. This data set is data of 102 occupations from the census of Canada in the early 70s. The variables that we have are, the units that we are observing are occupations, and the variables are education, which is the average number of years that a person who has this occupation has, what is the average income of this occupation, how many people in this occupation are women from 0 to 100, a prestige score using some kind of scale that we don't really know about, census code, which is just an identity that we don't care about, and then type, which is a categorical variable of white color, blue color and professional workers.
This data set is available in the Companion to Applied Regression Analysis or CAR package in R. And you can access the package in R by using these, and the help file that I'm showing here by using these R commands. So when we start analyzing a data set, the first thing that we need to do is to understand what the data are about. So what are the scales of the variables, what are the means and standard deviations, how are the variables related?
So calculating means and standard deviations is straightforward, so I will not cover that, but I'll show you one very useful data exploration tool before I go to the actual regression analysis. My favorite regression analysis tool for the exploratory analysis is the scatter plot matrix. So a scatter plot matrix is a collection of scatter plots. So we have for each pair of variables, we have income here on the y-axis, and we have education here on the x-axis, and then we have observations here. So we can see graphically, how education and income are related. We can see that they clearly have a positive relationship, but the relationship is not exactly linear, whereas there's a small increase initially in income on education increases, then at some point, it starts to increase more rapidly, so it kind of like curves up a bit. We can see one outlier here so that's an observation that is different from others, could be a problem for our analysis, or it could be an interesting case that we want to study more, I have another video about outliers. So what we do with this scatter plot matrix is that we look at the data, we look for patterns, and we look for expected patterns and also unexpected patterns. So here we have an interesting pattern between income and women. So when the share of women is low, then we have both high-income and low-income professions. When the share of women goes up, we only have low-income professions. So left is low, right is more, up is more, down is less. So being a woman basically guarantees that you don't have access to these high paying professions, but being a man doesn't guarantee you a high pay, instead, there are many occupations that men dominated, that don't pay that much. And this upper diagonal here is the same plots, they're just transposed. So this plot is the exact same as that plot, it's just a mirror image of the same plot. So that's useful to go through when you analyze your own data set. I will not go through that in detail, but it's a very useful tool to use when you actually start analyzing data yourself.
Let's go to the actual regression model. And to do regression analysis, we need to have a research question. And our research question is: does the prestige of an occupation depend on education, income, and share of women in the occupation? So expressed as a regression model, we say that the prestige of an occupation is a function of beta 0, which is some base level, when education income and the share of women is 0, of course, that doesn't exist but that's a base level in our regression analysis. Then we have beta 1 is the effective education, beta 2 is the effect of income, beta 3 is the effect of women, and then there's the error term u that represents some variation in the data, that the model doesn't explain. Here is a link to an explanation of the output, and I will now explain the output.
So I run the regression analysis myself, and running the regression analysis is straightforward. I just specify the model here, the lm command, the tilde defines that prestige is approximately a weighted sum of education income and women, and we are using the prestige data, and then we print out the summary of the regression model, which I stored as reg1 object. So we get some output, and what's the output about? The regression analysis output from your software is always some text and a lot of numbers. And we need to understand what these numbers mean, what do the numbers tell us, and which numbers are relevant for you, and which ones are not, because some of these numbers are not very meaningful unless you want to do, for example, model comparisons. So the output contains a couple of parts. First, we have the summary of the model that we ran. So just a reminder that this is the regression analysis, for example, if we run five or ten different regressions and we print the results at the end, it's useful to know, what is the regression model for these estimates? The second is the residual statistics, so this is the residual, the part that the model doesn't explain. The residual is assumed to be normally distributed, so we can check that the median should be at the mean, and residual has a mean of zero because that's how it's defined, and then the quartiles, first and second, the third quartile, should be about equally far apart from the mean, and also the minimums and maximums should be about equally far from the mean, and it's roughly the same. So the difference between is two units, and the maximum is 17 units, the minimum is 19 units. So it's about 10% difference, so not a big deal. Then we have the actual regression estimates. So these are the main results, these tell us what are the individual effects. I will be looking at explaining this part in more detail on the next slide, but that's the model estimates or the results. Then we have some model indices, most importantly we have the R-squared, we have some other things that tell us something about the goodness of this overall model, and some of these can also be used for comparing regression models that I will address in another video.
So how do we interpret this? Let's look at the regression coefficients first. So we have a couple of things here, we have the regression coefficient here, and then we have the standard error, which is estimated somehow, then we have t value, which is simply the test statistic for testing the null hypothesis. And the t value is defined as estimate divided by the standard error, and you can verify that it is 6.79 divided by 3.23 is actually -2.098 by using a hand calculator if you wish. Then we have the p-value. So the p-value here is calculated based on some assumptions. The assumptions are not important at this point, they are important but I'll cover them later. The null hypothesis for the p-value is that this regression coefficient is 0. So what's the likelihood, or what's the probability of obtaining an estimate of -6.8, if there is no effect in the population. The probability is 0.03, which is less than 0.05. We conclude that this is statistically significant, as p is less than 0.05, which is there the conventional minimum level that we want to have. This one star here indicates that it's below the 0.05 threshold, so this is a legend for the p-values. Three stars mean that it is below 0.001.
So these were the regression coefficients, then we have some other things also here, but importantly we have this very small regression coefficient. Why is the estimate for income so small? Does it mean that the income doesn't really matter when we consider the prestigious of occupations? The reason why this coefficient or what does this coefficient mean, we have to consider the scales of the variables. So the income was expressed as dollars. So one dollar increase in income increases your prestigious by 0.0013 units. So incomes are in thousands of dollars, so $1 increase doesn't really make a difference. So maybe it would make more sense to rescale the income. So that instead of being expressed as individual dollars, we would express it as thousands of dollars. So if we multiply this by 1000 then we get, what is the effect of increasing your income by a thousand units or thousand Canadian dollars, and then it's more meaningful. We can also see that the effect is small in absolute magnitude, ignoring the scale, doesn't mean that it's non-significant, because of the scaling issue, this is actually very significant, and it's actually a pretty large effect when you think about the scale of the variable from about a few thousand Canadian dollars to about twenty-five thousand. The obvious thing to do here would be to recode the income to thousands, so we get estimates that are more comparable, and they're easier to interpret as well.
We will next be looking at the model quality and this is done here. So the model quality indices tell us something about the overall model fit. The most important part is the R-squared statistic here, and R-squared tells us how much the model explains. So these three variables together: income, the share of women, and education, explained about 80% of the variation of the prestige. So we can say that prestige of occupation is mostly determined by amount of women, income and education. Of course, which one of those is the most important determinant, we would have to look at the actual individual regression coefficients, but all together they explain about 80% of the data.
Then we have adjusted R-squared, which is 0.79 and that is only slightly smaller than the R-squared, because we had only three explanatory variables, and we had more than 100 observations. So we have more than 30 observations for each explanatory variable, so the adjustment by the adjusted R-squared is pretty small, because the bias can be expected to be small as well with that good ratio of variables to observations. So that's the adjusted R-squared. Adjusted R-squared is useful for comparing models that are non-nested, what that means I'll cover later, but it's also useful for interpretation, so whenever you are unsure whether you should be looking at R-squared, adjusted R-squared, it's always a better idea to interpret the model using adjusted R-squared. If your sample size is large, it doesn't make a difference. These are about the same, so there's no meaningful difference between the two. If your sample size is small then adjusted R-squared typically is a more relevant metric for judging how well your model expressed the data.
Then we have some other statistics. So we have the residual standard error, so this is the standard deviation of the residuals and it estimates, what is the standard error, it's an estimate of the standard deviation of the error term, and it's not typically interpreted because it depends on the scale, but we can do some calculations, for example, R-squared is calculated using this number.
Then we have 98 degrees of freedom. That tells us, how complex the model is related to our data, and it tells that we could add 98 more things to the model and still, be able to estimate it. The degrees of freedom is not interpreted directly, but it's used for model comparison and it's used for calculating some statistics. For example, the F-statistic was shown here, its distribution depends on the degrees of freedom.
So the F-statistics is here, it's useful for more comparisons. It can be calculated based on the R-squared for example, it's not interpreted directly, but it has a distribution that we can use for testing a null hypothesis that the R-squared is exactly zero. So the t statistic, it provides a test statistic for a regression coefficient being zero, and we compare that against the t distribution. F-statistic is a test statistic for the R-squared being zero, and we compare it against the correct F-distribution, here is the p-value for that comparison. So getting this kind of results, if all the independent variables were completely unrelated with the dependent variable linearly, will be very unlikely. So we reject the null hypothesis and we conclude that these variables do explain the dependent variable also in the population.
So we get all kinds of things from the model. The most important part to interpret in these indices is the R-squared or the adjusted R-squared, if you don't know which one to use, use the adjusted one. The other ones are used for model comparisons calculating other things, and those will be irrelevant when you do model comparisons that I'll explain in another video.