TU-L0022_aalto-CUR-141790-3063741: Basic descriptive statistics (14:27)

Etusivu Koulut Kurssipalaute Palvelulinkit Intelliboard

Kurssiasetusten perusteella kurssi on päättynyt 06.04.2022 Etsi kursseja: TU-L0022

Basic descriptive statistics (14:27)

Vaatii arvosanan

The video explains basic descriptive statistics such as central tendency, mean, median, and dispersion (standard deviation and variance), correlation, covariance. The measures of central tendency quantify where the observations are located and measures of dispersion quantify how much the observations are spread around their location. The video also explains covariance and its standardized form, the correlation. These are both measures of linear association. Understanding these basic concepts is important if we want to understand more complex analysis techniques.

Click to view transcript

We will next cover a couple of basic statistical concepts that are related to data.

Descriptive statistics are some things, some numbers that we calculate from our data. They are like summaries of our data. To understand this basic concept is important, because quite often our more complicated models try to explain differences in mean, or try to explain variation assume something about the variances, and so on. To understand what these more complicated things do we have to understand the basics. And this is partly high school mathematics, but it's useful to revise it now before we go into more complicated things.

The first important thing to know is the concept of central tendency. These are data about three thousand one hundred seventy-one working-aged males from the United States. We have data on their heights. This shows the distribution of the heights, so we have some people that are very short we have some people that are very tall, and most people fall into these bins somewhere in the middle.

The bar here presents are a group of people, how many people fall into the category or for example 170 to 175 centimeters of height. So the bar presents the number of people. And this is a histogram, it presents how these heights are distributed. Then we have there the Kernel density plot of the same data. The Kernel density plot shows us the distribution in another way. We just have a curve, and this is the probability density function, that's what it is called formally, and the height here tells what is the relative probability of observing a person here, versus a person here for example. The area under the curve is always one, so if this is a scale here, that is in the tens, then the scale here must be 0.0 or something. So that shows us that most of the people are on in the middle, and then there are a few small short people and a few tall people.

Now the concept of central tendency tells us where this distribution is located at. Are the people roughly distributed around 175 centimeters, or are they perhaps about 160 centimeters or 180 centimeters? So it tells us what is the location, and that's another commonly used term for where this distribution is actually at on the axis. We have two measures of central tendency that are the most important.

The mean is the most commonly used. Mean is just the average . You take some of all these people's heights and you divide them by the number of people. Then if you have median, which is the height of a typical person. The median is calculated by putting these people in a line so that the shortest person is in front and then the tallest person is in the back, and everyone, there is ordered based on their height, and then you take the person who is right in the middle. So it's the value of the mid-most observation. Median is a useful statistic for quantifying what is a typical person like in the population because it's not sensitive to some people that were very tall or very short. For example, if we had a person here that was 1 million centimeters tall, which of course is impossible. Then the mean would be affected but the median wouldn't. Mean and median tell us what is that a typical person or the typical company or whatever you're studying like.

The other important concept is dispersion. Dispersion tells us how wide this distribution is. If everyone is about the same size, or is everyone between 174 and 176 centimeters, or are people between 150 centimeters and 2 meters. Dispersion tells how widely these persons are separated. The most common or the used measure of dispersion is the standard deviation. I'm not going to present you the definition, but it's important to know that one standard deviation is about plus or minus. One standard deviation cover about two-thirds of the data, and plus or minus two standard deviations cover about 95% of the data. These green lines show the two standard deviations and then 95% of the people about fit into this area. If the standard deviation was larger: here it's about seven point four centimeters, and if it was 10 centimeters it would mean that these two standard deviations would be about on bit less than 2 meters, and this minus 2 standard deviation would be a bit more than 150 centimeters. So it would tell us that people's heights vary more. Standard deviation tells how much the observations vary. There is this joke about why standard deviation is important: There are two statisticians and one is 150 centimeters tall one is 160 centimeters tall, and they are crossing a river that has a mean depth of 120 centimeters, and they're debating on whether they should cross or not. They decide not to because the mean doesn't tell what is the deepest part. So we have to understand also how much the depth of the river varies instead of just knowing what is the average depth of the river. Standard deviation tells us how many variations there is in the observations.

Then there's the concept of standardization that is also important. Standardization can be useful, and it can be harmful depending on the context. But it's important to understand why we standardize and when. For example, correlation, which I have mentioned before, is a standardized measure. It applies the idea of standardization, that is that you take the observations that are distributed like that, and the mean is at 175, with standard deviation is about seven centimeters. You subtract the mean from every observation, and you divide it by the standard deviation. That gives you a new variable that has a mean of zero, and standard deviation of exactly one. We are basically throwing away the data about the location and dispersion. And we are just retaining the data on where each individual is located related compared to our other individuals, and we also retain the overall shape of the distribution. This can sometimes make things easier to interpret. For example, if I say that I'm 176 centimeters tall, it may tell you something about my height, if you know what the height of other heights of the population is. If I would say that my height is at the mean, then everyone understands that typical Finnish males are about 50% of the time they're taller me and 50% of the time they're shorter than me so I'm average height. Standardization can make things easier to interpret, but it can also make things harder to interpret depending on the context. Standardization destroys information by eliminating information about where they are the central tendency, or the location and the dispersion from the data.

Then there's also variance, which is another measure of dispersion, and variance is related to standard deviation. It's used because it's more convenient for some computations and sometimes variance is easier to interpret. For example, in regression analysis, we assess how much of the variance the model explains the variation of the dependent variable. We don't do that in standard deviation metric- we do it in variance metric. The standard deviation has the same unit as the original variable. So if a standard deviation is seven, then we know that these bars are 7 centimeters from the mean, and if we multiply this variance variable by 2 then the standard deviation doubles, so that's convenient. The variance measures the same thing, it measures dispersion as well but on a different metric. And variance is defined as the mean of square differences from the mean. We take its observation we subtract the mean and we take a square or raised to the second power, and then we take a mean of those squares. That gives us the variance. Variance and standard deviations are related, so that the standard deviation of the data is the square root of the variance, and variance is the square of the standard deviation. We work with typically, if you just want to interpret how a variable is distributed we look at the standard deviation because it is in a metric that is easier to understand. If the standard deviation is 7 centimeters we can immediately say that 60% and something of the people are between 170 and 185. That's how standard deviations are used. Variance is 54.79, so that doesn't really really tell us where people are located at. But variance is useful for some other purposes and particularly in more complicated models we use variances. Sometimes you report both so that's possible as well. The concept of variance is important to understand the concept of covariance. The idea of variance was that it is the mean of the differences of each observation from the mean observation to the second power. So it's the same as our difference from the mean multiplied by the difference from the mean.

Then we have another statistic called covariance. Here we have data on height and weight. The covariance tells us how strongly a person's height is related to the person's weight. We can see here that are those people who tend to be tall or taller tend to also be heavier, so there's a covariance here. The covariance measures how much two variables vary together, and it's defined similarly to variance. Except that you don't multiply one variable with itself. Instead, you multiply one variable with another and you take a mean of that.

Then the concept of correlation, which many of you probably know, is just the covariance between standardized variables and correlation varies between minus 1 and plus 1. So correlation is a standardized measure of a linear association. When correlation is 1 then you know that two things are perfectly related, when it's minus 1 you know that two things are perfectly are negatively related. When it's zero then they are linearly unrelated. So correlation is a measure of linear association. That means that it measures how strongly observations are clustered in line. This is a scatter plot of two observations and one is a line. 0.8 is the observations are very closely clustered on the line, then are 0.4 is something that we observe with the plain eye. Zero means that there is no linear relationship, and then negative correlations mean that when one observation in one variable increases, then another one decreases. So that's the same except the directions opposite. The correlation doesn't tell us what is the magnitude of the change, so we can say that this is the correlation of 1. There is a huge effect of the X variable on the Y variable. This is the correlation of 1 as well there is a small effect of X variable on the Y variable. Here the Y variable doesn't increase as strongly with X variable, so correlation doesn't tell us about the magnitude of the effect. It just tells us how strong the association is. And this is a zero correlation because the Y variable doesn't vary, and then we have the negative correlations here. Importantly, correlation is a measure of linear association. Here we have two variables that are clearly associated. So there's a clear pattern but it's nonlinear. Here is another pattern that's nonlinear and these are this is a weak positive correlation and this is a clear association but it's nonlinear. So correlation only tells us if we can describe the data with a line. There could be some other kinds of relationships as well. So saying that two variables are uncorrelated doesn't mean that they are not related statistically, just means that the relationship cannot be expressed as a line.

Tämä sisältö näytetään esikatselutilassa, suoritustasi ei tallenneta.

TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022

Basic descriptive statistics (14:27)

Opiskelijoille

Opettajille

Palvelusta