MS-C1620 Statistical Inference

Homework exercise

To be solved at home before the exercise session.

1. Assume that we have an iid. random sample \(x_1, \ldots , x_{1000}\) and we’d like to use the normal Q-Q plot to assess whether the sample came from a normal distibution. How do you expect the normal Q-Q plot to roughly look like (i.e. what general features do you expect it to have and why), if the true distribution of the data is
  1. a normal distribution,
  2. a right-skew distribution,
  3. a left-skew distribution,
  4. a bimodal distribution,
  5. a distribution with light tails,
  6. a distribution with heavy tails?
2. Recall the differences between the interpretations of the \(\chi^2\) homogeneity test and \(\chi^2\) test for independence. Come up with a practical situation where the collected data can be expressed as a 2-by-2 table and a related research question for which the correct interpretation is through
  1. the \(\chi^2\) homogeneity test,
  2. the \(\chi^2\) test for independence.

Class exercise

To be solved at the exercise session.

Note: all the needed data sets are either given below or available in base R.

The data set rock contains measurements on 48 rock samples from a petroleum reservoir. Treat the data as an iid. random sample from some distribution and test whether the distribution of shape is normal.
1. Visualize the data to obtain a preliminary idea of the possible normality of the data.
2. Use the normal Q-Q plot to gain more evidence on the normality/non-normality of the data.
3. Conduct the Bowman-Shenton (Jarque-Bera) and the Shapiro-Wilk tests of normality on significance level 0.05.
4. After all the previous, would you conclude the data to be normal (or normal enough for methods with normality assumptions)?
5. Why is the data not really iid.?

The data set randu contains 400 triples of successive random numbers from the random number generator RANDU. Use the \(\chi^2\) goodness-of-fit test to assess whether the first elements in the triplets really obey the uniform distribution on \([0, 1]\).
1. Extract the first elements in the triplets and visualize their sample distribution.
2. Discretize the values into a suitable number of categories and calculate the observed category frequencies.
3. Compute the corresponding expected category probabilites under the uniform distribution on \([0, 1]\).
4. Recall the hypotheses of the test and conduct it on significance level 0.05.
5. What are the conclusions of the test? Compare your results with someone who used a different choice of categories for the discretization.

The data set Titanic contains information on the fate of passengers on the fatal maiden voyage of the ocean liner “Titanic”. We use the data to study whether there is a connection between the sex (Male/Female) of a passenger and surviving from the ship (No/Yes).
1. Extract a marginal table containing only the cross-tabulation of the variables Sex and Survived.
2. Find a suitable way to visualize the data.
3. Which test is appropriate for these data (and why?), \(\chi^2\) homogeneity test or the \(\chi^2\) test for independence?
4. Conduct your chosen test on significance level 0.05 and state your conclusions.

(Optional) Choose your favorite non-normal distribution and use simulations to study the Type II error probabilities of the Bowman-Shenton (Jarque-Bera) and Shapiro-Wilk tests of normality for that distribution on different sample sizes (e.g. \(n = 10, 100, 1000, 10000\)). That is, find out the probabilty of falsely concluding that the data comes from a normal distribution when it does not.