--- title: "Multivariate Statistical Analysis - Exercise Session 3" date: "`r format(Sys.time(), '%d.%m.%Y')`" output: bookdown::pdf_document2: latex_engine: xelatex number_sections: no toc: no header-includes: - \usepackage{amsmath} --- # Problem 1: Principal component analysis First we read the data. ```{r} decat <- read.table("DECATHLON.txt", header = TRUE, sep = "\t", row.names = 1) ``` Data includes results of 48 decathletes ("kymmenottelija" in Finnish). We remove variables \texttt{Points}, \texttt{Height} and \texttt{Weight} from the analysis. ```{r} decat <- decat[, -c(1, 12, 13)] head(decat) dim(decat) ``` Next let us visualize correlation matrix. Correlation matrix can be visualized, for example, with heat map. Figure \@ref(fig:corrgram1) show one way for making heatmap of correlation matrix. (ref:captionCorrgram1) Heatmap of correlation matrix with \texttt{corrgram} package using colored panels. ```{r corrgram1, fig.cap="(ref:captionCorrgram1)"} #install.packages("corrgram") # Install package corrgram library(corrgram) corrgram(decat, upper.panel = NULL) ``` Package \texttt{corrgram} gives various different options for visualizing correlation matrix. In Figure \@ref(fig:corrgram2) we use pie charts instead of colored panels. (ref:captionCorrgram2) Heatmap of correlation matrix with \texttt{corrgram} package using pie charts. ```{r corrgram2, fig.cap="(ref:captionCorrgram2)"} colors <- c("blue4", "blue3", "blue2", "blue1", "blue", "red", "red1", "red2","red3", "red4") corrgram(decat, lower.panel = panel.pie, diag.panel = panel.minmax, upper.panel = NULL, col.regions = colorRampPalette(colors)) ``` One can also use base R for plotting heatmap as in Figure \@ref(fig:base). ```{r base, fig.cap="Heatmap of correlation matrix with base R."} heatmap(cor(decat), Rowv = NA, Colv = NA, symm = T, col = colorRampPalette(c('red','white','blue'))(50)) ``` ## a) How much variation is explained by $k$ principal components? First, we perform PCA with correlation matrix. ```{r} decat_pca <- princomp(decat, cor = TRUE) ``` Summary tells how much variation is explained by $k$ principal components. ```{r} summary(decat_pca) ``` Figure \@ref(fig:cumprop) shows a visualization about how much variation is explained by $k$ principal components. ```{r cumprop, fig.cap="Cumulative proportion of variance explained by $k$ principal components."} vars <- decat_pca$sdev^2 var_prop <- vars / sum(vars) var_prop_cum <- cumsum(var_prop) plot(var_prop_cum, type = "b", pch = 21, lty = 3, bg = "skyblue", cex = 1.5, ylim = c(0, 1), xlab = "Principal component", ylab = "Cumulative proportion of variance explained", xaxt = "n", yaxt = "n") axis(1, at = 1:10) axis(2, at = 0:10 / 10, las = 2) ``` ## b) Interpreting principal components ```{r biplot, fig.cap="Biplot of scores and loadings."} pc12 <- decat_pca$scores[, 1:2] load12 <- decat_pca$loadings[, 1:2] pc_axis <- c(-max(abs(pc12)), max(abs(pc12))) ld_axis <- c(-0.8, 0.8) plot(pc12, xlim = pc_axis, ylim = pc_axis, pch = 21, bg = 8, cex = 1.25, xlab = paste0("PC 1 (", round(100 * var_prop[1], 2), "%)"), ylab = paste0("PC 2 (", round(100 * var_prop[2], 2), "%)")) par(new = T) plot(load12, axes = F, type = "n", xlab = "", ylab = "", xlim = ld_axis, ylim = ld_axis) axis(3, col = 2) axis(4, col = 2) arrows(0, 0, load12[, 1], load12[, 2], length = 0.1, col = 2) text(load12[, 1], load12[, 2], rownames(load12), pos = 3) abline(h = 0, lty = 3) abline(v = 0, lty = 3) ``` By looking at loadings one can interpret principal components ```{r} round(decat_pca$loadings[, 1:4], 2) ``` For example, possible interpretation for the first component is strength. Interpretations for first four principal components are same as in Session 2. ## c) PCA and outlier Now, let's add outlier to the original data. ```{r} s <- 9 decat[49, ] <- c(rep(1200, s), rep(18000, 10 - s)) rownames(decat)[49] <- "outlier" ``` Figure \@ref(fig:pairs) visualizes contaminated data. ```{r pairs, fig.cap="Pairwise scatter plots of contaminated data."} pairs(decat, gap = 0, upper.panel = NULL) ``` Now let's perform PCA for contaminated data ```{r} decat_pca <- princomp(decat, cor = TRUE) vars <- decat_pca$sdev^2 var_prop <- vars / sum(vars) ``` Figure \@ref(fig:biplotOutlier) shows that first principal component detects outlier the best. Also, it can be seen that PCA is quite nonrobust method. That is, outliers have significant effect to the results of PCA. ```{r biplotOutlier, fig.cap="Biplot of scores and loadings for contaminated data with respect to 1st and 2nd components. Score of the outlier is colored as black."} pc12 <- decat_pca$scores[, 1:2] load12 <- decat_pca$loadings[, 1:2] pc_axis <- c(-max(abs(pc12)), max(abs(pc12))) ld_axis <- c(-0.8, 0.8) plot(pc12, xlim = pc_axis, ylim = pc_axis, pch = 21, bg = c(rep(8,48), 1), cex = 1.25, xlab = paste0("PC 1 (", round(100 * var_prop[1], 2), "%)"), ylab = paste0("PC 2 (", round(100 * var_prop[2], 2), "%)")) par(new = T) plot(load12, axes = F, type = "n", xlab = "", ylab = "", xlim = ld_axis, ylim = ld_axis) axis(3, col = 2) axis(4, col = 2) arrows(0, 0, load12[, 1], load12[, 2], length = 0.1, col = 2) text(load12[, 1], load12[, 2], rownames(load12), pos = 3) abline(h = 0, lty = 3) abline(v = 0, lty = 3) ``` Figure \@ref(fig:biplotOutlier2) shows that outlier is not as well detected by 2nd or 3rd principal components. ```{r biplotOutlier2, fig.cap="Biplot of scores and loadings for contaminated data with respect to 2nd and 3rd components. Score of the outlier is colored as black."} pc23 <- decat_pca$scores[, 2:3] load23 <- decat_pca$loadings[, 2:3] pc_axis <- c(-max(abs(pc23)), max(abs(pc23))) ld_axis <- c(-0.8, 0.8) plot(pc23, xlim = pc_axis, ylim = pc_axis, pch = 21, bg = c(rep(8,48), 1), cex = 1.25, xlab = paste0("PC 2 (", round(100 * var_prop[2], 2), "%)"), ylab = paste0("PC 3 (", round(100 * var_prop[3], 2), "%)")) text(pc23["outlier", 1], pc23["outlier", 2], labels = "outlier", pos = 2) par(new = T) plot(load23, axes = F, type = "n", xlab = "", ylab = "", xlim = ld_axis, ylim = ld_axis) axis(3, col = 2) axis(4, col = 2) arrows(0, 0, load23[, 1], load23[, 2], length = 0.1, col = 2) text(load23[, 1], load23[, 2], rownames(load23), pos = 3) abline(h = 0, lty = 3) abline(v = 0, lty = 3) ``` # Problem 2: Affine equivariance ## a) In the following, let $X$ denote a $n\times p$ data matrix of $n$ i.i.d. $p$-variate observations $x_1, x_2, \ldots, x_n$ from some continuous distribution with a finite covariance matrix $\Sigma$. Furthermore, consider the transformation, $$ y_i = Ax_i + b, $$ where $A$ is a nonsingular $p\times p$ matrix $A$ and $b$ is a $p$-variate location vector $b$. ::: {.proposition #mean} Sample mean $T(\cdot)$ is affine equivariant. In other words, if you transform your data $X\to Y$ such that $$ y_i = Ax_i + b, $$ then $$ T(Y) = AT(X) + b, $$ for all nonsingular $p\times p$ matrices $A$ and for all $p$-vectors $b$. ::: ::: {.proof name="Proof of Proposition \@ref(prp:mean)"} Remember that $$ T(X) = \frac{1}{n} \sum_{i=1}^n x_i. $$ Then $$ T(Y) = \frac{1}{n} \sum_{i=1}^n (Ax_i + b) = A\left(\frac{1}{n}\sum_{i=1}^n x_i\right) + \frac{1}{n} n b = AT(X) + b. $$ ::: ## b) ::: {.proposition #cov} Sample covariance matrix $S(\cdot)$ is affine equivariant. In other words, if you transform your data $X\to Y$ such that $$ y_i = Ax_i + b, $$ then $$ S(Y) = AS(X)A^T, $$ for all nonsingular $p\times p$ matrices $A$ and for all $p$-vectors $b$. ::: ::: {.proof name="Proof of Proposition \@ref(prp:cov)"} Remember that $$ T(X) = \frac{1}{n-1} \sum_{i=1}^n \left(x_i - T(X)\right)\left(x_i - T(X)\right)^T. $$ Then $$ \begin{split} S(Y) &= \frac{1}{n-1} \sum_{i=1}^n \left(Ax_i + b - T(Y)\right)\left(Ax_i + b - T(Y)\right)^T \\ &= \frac{1}{n-1} \sum_{i=1}^n \left(Ax_i + b - (AT(X) + b)\right)\left(Ax_i + b - (AT(X) + b)\right)^T \\ &= \frac{1}{n-1} \sum_{i=1}^n \left(A(x_i - T(X))\right)\left(A(x_i - T(X))\right)^T \\ &= A \left(\frac{1}{n-1} \sum_{i=1}^n \left(x_i - T(X)\right)\left(x_i - T(X)\right)^T \right) A^T \\ &= AS(X)A^T. \end{split} $$ :::