TU-L0022_aalto-CUR-141790-3063741: Idea of regression analysis (8:33)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Idea of regression analysis (8:33)

Vaatii arvosanan

Description to be added.

Regression analysis is one of the most commonly used statistical analysis techniques in social sciences because it can be used for ruling out alternative explanations for observed associations (i.e. correlations). Regression analysis has one dependent and one or more independent variables that are assumed to be linearly related. The objective of regression analysis is to explain the dependent variable as well as possible using the independent variables. The video presents Venn diagram, path diagram, and the graphical interpretation of regression analysis.

Click to view transcript

I will next explain the idea of regression analysis. Regression analysis is one of the most used analysis tools in quantitative research, and most applications of quantitative techniques can be thought of as special cases or extensions of this particular analysis.

The regression analysis results are typically presented as a table like this. Here we have four different regression models, we have different regression coefficients, we have different model indices. There are certain assumptions behind this table that you need to understand and you need to understand, what the numbers tell us. And we will be looking at these kinds of tables in the next couple of videos, but I will first explain, what is the regression analysis about?

In regression analysis, we have two kinds of variables, we have one dependent variable that we want to explain, for example, a company's profitability, ROA could be a dependent variable. Then we have multiple independent variables. The independent variables are variables that we use to explain the dependent variable, for example, we could have CEO gender and company size, and company industry. Then regression analysis answers the question, how much do these variables together explain the variation of the dependent variable, and which ones of the variables are the most important ones for explaining that. Regression analysis allows us to control for alternative explanations for an observed correlation. In the case of the paper by Hekman from which the previous table was from, they explained patient satisfaction scores with, for example, physician productivity, physician quality, and physician accessibility. You have one thing that you explain with multiple things to see, which one of those multiple potential explanatory variables matters.

The idea of regression analysis is commonly presented as a Venn diagram. This Venn diagram is useful for illustrating some properties of regression analysis, but it doesn't illustrate all the properties, but it's a good starting point, nevertheless. The idea of these circles here is that this circle here presents the variation of company performance or return on assets. This is the variation of the dependent variable. This is the variation of the independent variable that we're interested in, which in this case is the CEO gender. And this is the variation in company size. Now we are interested in, how much of this co-variation or correlation between gender and performance is due to gender, and how much is due to the effect of size because size and gender are correlated. We could say that the correlation between gender and performance is partly due to the presumed causal influence of CEO gender on performance, and partly because smaller companies tend to be more profitable, this correlation here, and also because smaller companies tend to more likely hire women CEOs, which is this correlation here.

Now we want to use regression analysis to parcel out this part that is shared by gender and size and performance, to get the unique effect of the performance. We could think of regression analysis as doing something like this. It eliminates the effect of company size on the relationship between gender and performance. Of course, we are not limited to just two independent variables, we can have multiple competing explanations for the dependent variable in the model. Typically, we would have in the ballpark of 10 or 20 variables. We can take additional bites away to get a cleaner estimate of this correlation between gender and performance, that is free of any third causes. Ultimately, we would get a clean causal effect between gender and performance if we have included all relevant controls to the model. That, of course, is easier said than done.

Regression analysis is a statistical model, and a model is an equation, so whenever you hear up the term model it means that there is some math, and the model can also be presented as a path diagram like this. I will first talk about the path diagram. The path diagram here has one dependent variable y, three independent variables x, and the x's are independent, they are allowed to be freely correlated. Free correlation is this double-headed curved arrow, which means that we don't really care about how these different explanatory variables usually denoted with x, are related, but we are interested in estimating, how they explain or predict the dependent variable y. The strength of influence of each variable is quantified by a regression coefficient beta. We have one beta for each x here, then we have beta 0 or the intercept, which tells us the base level of y, when all of these explanatory or independent variables are at 0. And then we have some variation u, that the model doesn't explain. This is the remaining variation that is not explained by the model. Let's say that the model explains 20% of the variation of the dependent variable, which is fairly typical for business research. Then the unexplained variation would account for 80% of the true variation of y, in the data.

In equation form, we can see that the y here is a weighted sum of the x's, and the weights are the regression coefficients. And each of these regression coefficients quantifies, what is the influence of one variable, one of the independent variables on the dependent variable. For example, we can model patient satisfaction as a weighted sum of physician productivity, physician quality, physician accessibility, and some variation that the model doesn't explain. What's important to understand is that these effects are independent so when x increases one unit then that beta tells, what is the effect of one unit increase independently of the other variables? And also, they are linear so that we always assume that one unit increase in x is always associated to the same amount of increase in y which is quantified by the beta.

Graphically, regression analysis can be understood as a line. And I will show you a two-variable regression analysis. This is also called the simple regression because we have only one independent variable. Here the independent variable is, let's say it's years of education for example, and this dependent variable here is let's say it's salary.

And we are interested in knowing, what is the linear relationship, so what's the best line that explains this data. Regression analysis in this simple regression with one independent variable, basically, you can think of it as plotting all the data as a scatterplot here, we will show some scatter plots a bit later, and then draw a line through the data, so that gives us the regression line. The slope of this line here, how strongly it goes up or down, is quantified by the regression coefficient. We make some assumptions when we run a regression analysis. One of the key assumptions in justifying regression analysis is that these observations then are equally and normally distributed around the regression line. That when we have a regression line here, the most likely case is that the observations are close to the line, but there can be some observations that are far from the line, but they should be relatively rare.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Idea of regression analysis (8:33)

Opiskelijoille

Opettajille

Palvelusta