TU-L0022 - Statistical Research Methods D, Lecture, 25.10.2022-29.3.2023
This course space end date is set to 29.03.2023 Search Courses: TU-L0022
Basic descriptive statistics (14:27)
The
video explains basic descriptive statistics such as central tendency,
mean, median, and dispersion (standard deviation and variance),
correlation, covariance. The measures of central tendency quantify where
the observations are located and measures of dispersion quantify how
much the observations are spread around their location. The video also
explains covariance and its standardized form, the correlation. These
are both measures of linear association. Understanding these basic
concepts is important if we want to understand more complex analysis
techniques.
Click to view transcript
We will next cover a couple of basic statistical concepts that are related to data. Descriptive
statistics are some things, some numbers that we calculate from our
data. They are like summaries of our data. To understand this basic
concept is important, because quite often our more complicated models
try to explain differences in mean, or try to explain variation assume
something about the variances, and so on. To understand what these more
complicated things do we have to understand the basics. And this is
partly high school mathematics, but it's useful to revise it now before
we go into more complicated things. The first important
thing to know is the concept of central tendency. These are data about
three thousand one hundred seventy-one working-aged males from the
United States. We have data on their heights. This shows the
distribution of the heights, so we have some people that are very short
we have some people that are very tall, and most people fall into these
bins somewhere in the middle. The bar here presents are a
group of people, how many people fall into the category or for example
170 to 175 centimeters of height. So the bar presents the number of
people. And this is a histogram, it presents how these heights are
distributed. Then we have there the Kernel density plot of the same
data. The Kernel density plot shows us the distribution in another way.
We just have a curve, and this is the probability density function,
that's what it is called formally, and the height here tells what is the
relative probability of observing a person here, versus a person here
for example. The area under the curve is always one, so if this is a
scale here, that is in the tens, then the scale here must be 0.0 or
something. So that shows us that most of the people are on in the
middle, and then there are a few small short people and a few tall
people. Now the concept of central tendency tells us
where this distribution is located at. Are the people roughly
distributed around 175 centimeters, or are they perhaps about 160
centimeters or 180 centimeters? So it tells us what is the location, and
that's another commonly used term for where this distribution is
actually at on the axis. We have two measures of central tendency that
are the most important. The mean is the most commonly
used. Mean is just the average . You take some of all these people's
heights and you divide them by the number of people. Then if you have
median, which is the height of a typical person. The median is
calculated by putting these people in a line so that the shortest person
is in front and then the tallest person is in the back, and everyone,
there is ordered based on their height, and then you take the person who
is right in the middle. So it's the value of the mid-most observation.
Median is a useful statistic for quantifying what is a typical person
like in the population because it's not sensitive to some people that
were very tall or very short. For example, if we had a person here that
was 1 million centimeters tall, which of course is impossible. Then the
mean would be affected but the median wouldn't. Mean and median tell us
what is that a typical person or the typical company or whatever you're
studying like. The other important concept is
dispersion. Dispersion tells us how wide this distribution is. If
everyone is about the same size, or is everyone between 174 and 176
centimeters, or are people between 150 centimeters and 2 meters.
Dispersion tells how widely these persons are separated. The most common
or the used measure of dispersion is the standard deviation. I'm not
going to present you the definition, but it's important to know that one
standard deviation is about plus or minus. One standard deviation cover
about two-thirds of the data, and plus or minus two standard deviations
cover about 95% of the data. These green lines show the two standard
deviations and then 95% of the people about fit into this area. If the
standard deviation was larger: here it's about seven point four
centimeters, and if it was 10 centimeters it would mean that these two
standard deviations would be about on bit less than 2 meters, and this
minus 2 standard deviation would be a bit more than 150 centimeters. So
it would tell us that people's heights vary more. Standard deviation
tells how much the observations vary. There is this joke about why
standard deviation is important: There are two statisticians and one is
150 centimeters tall one is 160 centimeters tall, and they are crossing a
river that has a mean depth of 120 centimeters, and they're debating on
whether they should cross or not. They decide not to because the mean
doesn't tell what is the deepest part. So we have to understand also how
much the depth of the river varies instead of just knowing what is the
average depth of the river. Standard deviation tells us how many
variations there is in the observations. Then there's
the concept of standardization that is also important. Standardization
can be useful, and it can be harmful depending on the context. But it's
important to understand why we standardize and when. For example,
correlation, which I have mentioned before, is a standardized measure.
It applies the idea of standardization, that is that you take the
observations that are distributed like that, and the mean is at 175,
with standard deviation is about seven centimeters. You subtract the
mean from every observation, and you divide it by the standard
deviation. That gives you a new variable that has a mean of zero, and
standard deviation of exactly one. We are basically throwing away the
data about the location and dispersion. And we are just retaining the
data on where each individual is located related compared to our other
individuals, and we also retain the overall shape of the distribution.
This can sometimes make things easier to interpret. For example, if I
say that I'm 176 centimeters tall, it may tell you something about my
height, if you know what the height of other heights of the population
is. If I would say that my height is at the mean, then everyone
understands that typical Finnish males are about 50% of the time they're
taller me and 50% of the time they're shorter than me so I'm average
height. Standardization can make things easier to interpret, but it can
also make things harder to interpret depending on the context.
Standardization destroys information by eliminating information about
where they are the central tendency, or the location and the dispersion
from the data. Then there's also variance, which is
another measure of dispersion, and variance is related to standard
deviation. It's used because it's more convenient for some computations
and sometimes variance is easier to interpret. For example, in
regression analysis, we assess how much of the variance the model
explains the variation of the dependent variable. We don't do that in
standard deviation metric- we do it in variance metric. The standard
deviation has the same unit as the original variable. So if a standard
deviation is seven, then we know that these bars are 7 centimeters from
the mean, and if we multiply this variance variable by 2 then the
standard deviation doubles, so that's convenient. The variance measures
the same thing, it measures dispersion as well but on a different
metric. And variance is defined as the mean of square differences from
the mean. We take its observation we subtract the mean and we take a
square or raised to the second power, and then we take a mean of those
squares. That gives us the variance. Variance and standard deviations
are related, so that the standard deviation of the data is the square
root of the variance, and variance is the square of the standard
deviation. We work with typically, if you just want to interpret how a
variable is distributed we look at the standard deviation because it is
in a metric that is easier to understand. If the standard deviation is 7
centimeters we can immediately say that 60% and something of the people
are between 170 and 185. That's how standard deviations are used.
Variance is 54.79, so that doesn't really really tell us where people
are located at. But variance is useful for some other purposes and
particularly in more complicated models we use variances. Sometimes you
report both so that's possible as well. The concept of variance is
important to understand the concept of covariance. The idea of variance
was that it is the mean of the differences of each observation from the
mean observation to the second power. So it's the same as our difference
from the mean multiplied by the difference from the mean. Then
we have another statistic called covariance. Here we have data on
height and weight. The covariance tells us how strongly a person's
height is related to the person's weight. We can see here that are those
people who tend to be tall or taller tend to also be heavier, so
there's a covariance here. The covariance measures how much two
variables vary together, and it's defined similarly to variance. Except
that you don't multiply one variable with itself. Instead, you multiply
one variable with another and you take a mean of that. Then
the concept of correlation, which many of you probably know, is just
the covariance between standardized variables and correlation varies
between minus 1 and plus 1. So correlation is a standardized measure of a
linear association. When correlation is 1 then you know that two things
are perfectly related, when it's minus 1 you know that two things are
perfectly are negatively related. When it's zero then they are linearly
unrelated. So correlation is a measure of linear association. That means
that it measures how strongly observations are clustered in line. This
is a scatter plot of two observations and one is a line. 0.8 is the
observations are very closely clustered on the line, then are 0.4 is
something that we observe with the plain eye. Zero means that there is
no linear relationship, and then negative correlations mean that when
one observation in one variable increases, then another one decreases.
So that's the same except the directions opposite. The correlation
doesn't tell us what is the magnitude of the change, so we can say that
this is the correlation of 1. There is a huge effect of the X variable
on the Y variable. This is the correlation of 1 as well there is a small
effect of X variable on the Y variable. Here the Y variable doesn't
increase as strongly with X variable, so correlation doesn't tell us
about the magnitude of the effect. It just tells us how strong the
association is. And this is a zero correlation because the Y variable
doesn't vary, and then we have the negative correlations here.
Importantly, correlation is a measure of linear association. Here we
have two variables that are clearly associated. So there's a clear
pattern but it's nonlinear. Here is another pattern that's nonlinear and
these are this is a weak positive correlation and this is a clear
association but it's nonlinear. So correlation only tells us if we can
describe the data with a line. There could be some other kinds of
relationships as well. So saying that two variables are uncorrelated
doesn't mean that they are not related statistically, just means that
the relationship cannot be expressed as a line.