Homework exercise

To be solved at home before the exercise session.

  1. Visit the website https://datavizproject.com/ and pick one data visualization/plot that interests you. Find out how it is drawn and what aspects of the data the different components represent. Be prepared to explain how your visualization of choice works in the class.

  2. Type the command data() in R to show all data sets currently available in your installed packages. Go through the data sets and pick one that interests you. Check the help file of the data set using the command ?packagename for more detailed information. Be prepared to describe your answers to the following questions in the class:
    • What is the purpose of the data? What kind of phenomenon does it describe?
    • What kind of study is behind the data (observational, controlled, simulation, survey or something else)?
    • How is the data represented in R (univariate, multivariate, time series…)?
    • What kind of plots would you use to best summarize the data?
    • What kind of numerical statistics would you use to best summarize the data?

Class exercise

To be solved at the exercise session.

Note: all the needed data sets are available in base R.

  1. The data set rivers contains the lengths of 141 major rivers in North America.
    1. Find a suitable way to visualize the data and plot it.
    2. How are the lengths distributed based on your plot?
    3. Discretize the lengths into six classes: [min, 250], (250, 500], (500, 750], (750, 1000], (1000, 1250], (1250, max]. The function cut may prove helpful.
    4. Find a suitable way to visualize the discretized data and plot it.
    5. Which of the two visualizations is more informative?

  1. The data set islands contains the areas of all landmasses in the world which exceed a certain threshold.
    1. Find a suitable way to visualize the data and plot it.
    2. How are the landmasses distributed based on your plot?
    3. Compute both robust and non-robust measures of location and scatter for the data.
    4. Remove some of the outliers (and think of a possible reason for justifying this!) from the data and compute the same measures as in part c.
    5. Compare the results of part c and part d.

  1. The data set Nile contains yearly measurements of the flow of the river Nile.
    1. Find a suitable way to visualize the data and plot it.
    2. How has the flow of the river changed during the years 1871-1970 based on the plot?
    3. Calculate the values of the following statistics for the flow: mean, standard deviation, variance, minimum, maximum, median, median absolute deviation, mode, skewness and kurtosis.
    4. How are each of the statistics in part c visible in the plot of part a?

  1. (Optional) Try out the 3d-visualization tools in the package rgl. The following code plots an interactive 3d-scatter plot of the first three variables in the iris data. Find out how you can colour the points in the plot according to the variable Species.

# Opens in a new window
plot3d(iris[, 1:3])

  1. (Optional) Pretty plots are often cumbersome to produce with base R and numerous packages offer various more attractive approaches. Try out the package ggplot2 by running the following code.

ggplot(data = mpg, aes(x = hwy, y = cty)) +
  geom_point() +
  labs(x = "Highway miles per gallon", y = "City miles per gallon") 

What does the plot represent? Experiment with the code and find out how you can color the points according to the class of the car.

Numerous tutorials about ggplot2 can be found online. Check out at least https://r4ds.had.co.nz/data-visualisation.html and http://r-statistics.co/ggplot2-Tutorial-With-R.html.