--- title: "Multivariate Statistical Analysis - Exercise Session 4" date: "`r format(Sys.time(), '%d.%m.%Y')`" output: bookdown::pdf_document2: latex_engine: xelatex number_sections: no toc: no header-includes: - \usepackage{amsmath} --- # Problem 1: Robust PCA ## a) Install the package \texttt{rrcov} and read the data Install the package \texttt{rrcov}. ```{r eval=FALSE} install.packages("rrcov") ``` Import package \texttt{rrcov} and read the data. ```{r message=FALSE} library(rrcov) wood <- read.table("wood.txt", header = TRUE, sep = "\t") head(wood) ``` The data set in this exercise is a modified version from Draper and Smith (1966) and it was used to determine the influence of anatomical factors on wood specific gravity, with five explanatory variables. The data is contaminated by replacing a few observations with outliers. ## b) Plot the variables pairwise Outliers are somewhat visible from Figure \@ref(fig:pairsWood). Still, pairwise scatter plots are quite hard to read when data is high dimensional. ```{r pairsWood, fig.cap="Pairwise scatter plots of variables."} # Color possible outliers color_outliers <- rep("black", nrow(wood)) color_outliers[c(10, 12, 13, 15)] <- "grey" # Plotting pairs(wood, pch = 16, upper.panel = NULL, col = color_outliers) ``` ## c) Sample covariance and MCD MCD based location and scatter estimates are calculated with \texttt{CovMcd} function. Parameter \texttt{alpha} controls the size of the subsets over which the determinant is minimized. Function \texttt{CovMcd} returns an S4 object. Most relevant fact about S4 objects regarding this course is that slots of S4 object are accessed with \texttt{@} not with \texttt{\\$}. ```{r} cov_regular <- cov(wood) cov_mcd <- CovMcd(wood, alpha = 0.5) cov_regular cov_mcd@cov ``` ## d) Regular and robust Mahalanobis distances Note that function \texttt{mahalanobis} returns squared mahalanobis distances. (ref:captionMaha) Regular and robust Mahalanobis distances for each observation of the \texttt{wood} data set. ```{r maha, fig.cap="(ref:captionMaha)"} maha_regular <- sqrt(mahalanobis(wood, center = colMeans(wood), cov = cov_regular)) maha_robust <- sqrt(mahalanobis(wood, center = cov_mcd@center, cov = cov_mcd@cov)) n <- nrow(wood) plot(rep(1:n, 2), c(maha_regular, maha_robust), pch = 16, col = c(rep("green", n), rep("blue", n)), xlab = "Observation", ylab = "Mahal. distance") legend("topleft", col = c("green", "blue"), cex = 0.8, legend = c("Regular", "MCD"), pch = 16) ``` Figure \@ref(fig:maha) shows that with MCD estimates we can spot potential outliers, however, by using sample mean and sample covariance outliers are not visible. ## e) It seems that scales of variables are different. Additionally, we do not know if variables have the same units. Thus we perform correlation based PCA. ```{r} apply(wood, 2, range) ``` The standard PCA is computed using princomp and the MCD PCA using PcaCov. By setting \texttt{scale = TRUE} we perform the correlation based PCA. We notice that the loadings are different implying different interpretations for the principal components. ```{r} pca_regular <- princomp(wood, cor = TRUE) pca_mcd <- PcaCov(wood, scale = TRUE, cov.control = CovControlMcd(alpha = 0.5)) pca_regular\$loadings[, ] pca_mcd@loadings ``` ```{r pairsPcaRegular, fig.cap="Pairwise scatter plots of scores for regular PCA."} pairs(pca_regular\$scores, pch = 16, upper.panel = NULL, col = color_outliers) ``` ```{r pairsPcaMcd, fig.cap="Pairwise scatter plots of scores for robust PCA."} pairs(pca_mcd@scores, pch = 16, upper.panel = NULL, col = color_outliers) ``` # Problem 2: Estimators of scatter ```{r} library(mvtnorm) set.seed(123) n <- 200 ``` ## a) Bivariate normal distribution ```{r data1, fig.cap="Scatter plot of a sample from a bivariate normal distribution."} data1 <- rmvnorm(n, mean = c(0, 0), sigma = diag(2)) plot(data1, pch = 16) ``` ```{r} cov(data1) CovMcd(data1, alpha = 0.5)@cov ``` ## b) Bivariate \$t\$-distribution (ref:data2Caption) Scatter plot a of sample from a bivariate \$t\$-distribution. ```{r data2, fig.cap="(ref:data2Caption)"} data2 <- rmvt(n, df = 5, sigma = diag(2)) plot(data2, pch = 16) ``` ```{r} cov(data2) CovMcd(data2, alpha = 0.5)@cov ``` ## c) Bivariate Weibull-Gamma ```{r data3, fig.cap="Scatter plot of a sample from a bivariate distribution where the first component follows the Weibull distribution and the second component follows the Gamma distribution."} x1 <- rweibull(n, shape = 1, scale = 2) x2 <- rgamma(n, shape = 2, scale = 1) data3 <- cbind(x1, x2) plot(data3, pch = 16) ``` ```{r} cov(data3) CovMcd(data3, alpha = 0.5)@cov ``` # Hint for Homework 4 - Chapter 10.9 of the book "Introduction to Mathematical Statistics" by Hogg et al. has good discussion about influence function and breakdown point.