TU-L0022 - Statistical Research Methods D, Lecture, 2.11.2021-6.4.2022
This course space end date is set to 06.04.2022 Search Courses: TU-L0022
Statistical inference (12:38)
This video explains statistical inference, which is the task of making statistical claims about the population.
Click to view transcript
Now
we start talking about statistical inference. Which refers to the task
of making some kind of statistical claims about the population. So our
question now is we have an observation that the return on assets
difference is 4.7% points. Is it a big deal? Does it matter? What does
the data tell us? And what kind of inferences can we make from this
sample? 4.7 % points is a pretty big difference. So what? What does it
mean? What the data are and what the data tell us directly, is that in
one point of time and in one sample, firms led by women are more
profitable. That's what the data tells us and now the question is can we
generalize? Can we say something beyond that particular sample? Can we
say that this generalizes to other years or is it just one year? If it's
just one year, and the women-led companies happen to be more profitable
but in wouldn't generalize to other years, then it's not a big deal. If
it generalizes to other years then it's probably a big deal. The second
question is: Does it generalize to other firms? Is it just these 500
companies in which the women-led companies are more profitable or does
it generalize to the thousand largest companies or all companies in
Finland or all companies in all countries. How do we generalize, how
widely can we generalize?
We
don't yet discuss causality or causal claims but just making claims
that there is an association between two variables or difference between
two groups in a population, based on sample data.
And our
example now comes from the Talouselämä 500-magazine that I covered in a
previous video. This is a Finnish business magazine that follows the 500
largest Finnish companies. And in one particular year, in 2005, there
were big headlines in Finnish newspapers because on this list the return
on assets of women-led companies was 4.7% points higher than in men-led
companies.
The first question that we need to ask
when we start discussing generalizability of a sample statistic. This is
a sample statistic. It's something calculated, a number from a sample.
Does it generalize to the population? We have to ask: Could this be by
chance only? Is it possible that because of sampling variation, we just
happen to have the companies that were led by women happen to have a
better year than companies that were led by men? Could it be just random
occurrence or is it evidence of a systematic difference?
And we
have to ask two important questions to answer that, whether it can be by
chance only. The one is: Is 4.7% points a large difference? Large
differences really occur by chance only, small differences occur by
chance only frequently. When we calculate something from a sample, the
sample estimate is hardly ever exactly the population value. It's
somewhere close. So is it far enough to say that it's improbable that
this kind of result could occur by chance only. Or is it close enough to
the population value that it actually makes no difference.
Then
we have to look if it is a large effect. The mean ROA is about 10 in
this sample. And 4.7% point difference would mean that if the men-led
companies are let's say 8% ROA then women-led companies are 13% ROA. So
they are more than 50% more profitable than men-led companies, that's a
big thing. That's a big difference.
The second important question
relates to sample size. We know that the full sample is 500 companies,
but that's not the full story. We also have to consider how many women
there are. If there are just five women or if there are 250 women, those
two conditions would lead to very different conclusions. It happens to
be that there were 22 women in the sample so that's a fairly small
number of observations.
Now the question of statistical
inference. We want to see if actually, this return on assets of 4.7%
point is large enough that we can conclude that there probably is a
difference, a systematic difference. And this is not due to sampling
fluctuations only.
We have to ask the question: What would be the
probability of getting this kind of difference by chance only. You
watched the video about John Rauser. What would John Rauser do in this
scenario?
We have 500 companies, we want to know whether the
difference between the women-led companies and the men-led companies
could occur by chance only. What we do is, one strategy of answering
that question is to do a permutation analysis or a permutation test
which is a fairly intuitive way of understanding statistical testing.
And what we do is that we take the list of companies. And we have the
largest companies, I got the data from a database this may not be the
exact same 500 companies, but it doesn't matter for the example. We
choose 22 companies at random and we compare the remaining 478 and we
calculate the difference. So we have 22 companies again, a mean of 22
companies compared to mean of 478 companies. We repeat 10,000 times and
we see what's the difference. What is the probability of getting at
least a 4.7% point difference in these comparisons?
So let's take
a look at the results. I did the analysis, here are the first 200
comparisons. We can see that quite often when we take randomly 22
companies and compare them against the 478 remaining companies, the
difference is very close to zero. So here is very close to zero, no
difference. Sometimes we get a negative difference here. So the
difference actually, there's no systematic difference, there cannot be,
because I chose companies randomly and two random samples are always
comparable.
But we get these differences larger than 4.7, we get
9/200 comparisons using this permutation testing strategy. So the
probability of getting 4.7% points difference or larger in this test is
0.045 for the first 200 observations. Is that enough evidence to
conclude that the 4.7% point difference is unlikely to be by chance
only? Let's take a look at the bigger picture. So we have the
distribution of the estimates and we have 10,000 repeated samples. And
sometimes we get a large negative estimate, sometimes we get a large
positive estimate, typically we get an estimate where there is no
difference because there should not be any. Because we are taking a
random sample from population comparing to another sample there should
be, because of randomization there shouldn't be any differences.
The
probability for getting 4.7% points or higher difference is
0.0347/10,000 replications. This probability is called the p-value.
It
is the probability of observing an effect equally large or greater
under there being no effect. We can also, we don't have to do the
permutation testing, we don't have to do the random sampling. Because
this shape here looks familiar; so that's the normal distribution. We
see that the difference is normally distributed and many things are,
many in statistics they follow normal distribution. So we can just,
instead of approximating this difference by taking random samples, we
only need to find out what is the right normal distribution so where do
we draw the distribution. And then compare against that normal
distribution.
So here is the normal distribution, overlaid
against that observed, if observed distribution of estimates. Here we
have the mean of the normal distributions, we see here that's zero, so
that's our base case of no difference. And then normal distribution,
also we need to know the dispersion, the standard deviation. And this
standard deviation is estimated using the standard error. So We have the
standard error which the statistical software will print out for us.
And we draw a normal distribution mean at 0, which is the null
hypothesis value of no difference. Then we have the dispersion here
which is quantified by the standard error.
Then we compare how
probable, what is the size of this area here. How probable it is to get
an estimate of 4.7% points or higher given the null hypothesis? 0.04,
that is less than 0.05, which is the normal criterion for statistical
inference, for statistical significance.
Could it be by chance
only? P is less than 0.05. If this was a research paper, we would
conclude that there is a statistically significant difference and we
would write a paper. We would get it hopefully published somewhere
because we have a statistically significant result.
Of course, we
have to think that, in this particular scenario, there are probably
reporters who want to say something positive about women. So they could
do multiple comparisons, they could do comparisons of growth,
profitability, other important statistics.
And if they happen to
find one statistic that makes women look better, then they'd write up a
newspaper article about it. So the p-values work well when you do just
one comparison. But because of the nature of the test, we will get
eventually large effects by chance only. If we repeat this study, for
example, every year, we check profitability and we check liquidity, we
check growth and we do that over ten years so we have 30 comparisons.
One of those comparisons will almost certainly give us P is less than
0.05 by chance only.
So P less than 0.05 is not very strong
evidence. It is some evidence if it is just one comparison. But if we do
multiple comparisons, we can do this kind of data mining and always get
a P that is less than 0.05. If we would have less than 0.001 then I
would buy the claim that there is actually an effect in the population.