--- title: "Multivariate Statistical Analysis - Exercise Session 2" date: "`r format(Sys.time(), '%d.%m.%Y')`" output: bookdown::pdf_document2: latex_engine: xelatex number_sections: no toc: no header-includes: - \usepackage{amsmath} --- # Problem 1: Principal component analysis First we read the data. ```{r} decat <- read.table("DECATHLON.txt", header = TRUE, sep = "\t", row.names = 1) ``` Data includes results of 48 decathletes ("kymmenottelija" in Finnish). We remove variables \texttt{Points}, \texttt{Height} and \texttt{Weight} from the analysis. ```{r} decat <- decat[, -c(1, 12, 13)] head(decat) dim(decat) ``` ## a) Visualize original data and familiarize yourself with function \texttt{princomp} First we visualize the original data. One way to visualize the data is a pairwise scatter plot. It is hard to get any sense from Figure \@ref(fig:pairs). ```{r pairs, fig.cap="Pairwise scatter plots of all variables."} pairs(decat, gap = 0, upper.panel = NULL) ``` Figure \@ref(fig:scatter) shows just one of the scatter plots. In this particular scatter plot points are replaces with names of decathletes. ```{r scatter, fig.cap="Scatter plot of two specific variables."} plot(decat$R100m, decat$R400m, xlab = "Running 100m", ylab = "Running 400m", type = "n") text(decat$R100m, decat$R400m, labels = rownames(decat)) ``` Next we familiarize ourselves with \texttt{princomp} function. Good place to start is the help pages. ```{r} ?princomp ``` Now let us perform principal component analysis with decathlon data. We set argument \texttt{cor} as \texttt{FALSE}. This means that we perform PCA with covariance matrix. ```{r} decat_pca <- princomp(decat, cor = FALSE) names(decat_pca) ``` Function princomp returns a list of objects that are specified on the help pages. Below are explanations about objects that are returned: - Standard deviations of each principal component \texttt{sdev}. Remember that these are just the square roots of eigenvalues of the sample covariance matrix corresponding to original data set. Note that function \texttt{princomp} uses divisor $n$ instead of $n-1$ when calculating the sample covariance. ```{r} n <- nrow(decat) decat_pca$sdev sqrt(eigen((n - 1) / n * cov(decat))$values) ``` - Loadings, that is, eigenvector matrix $G$ corresponding sample covariance matrix. Notice that this is not a matrix object but a special object of class loadings. Values very close to zero are showed as empty. ```{r} load <- decat_pca$loadings class(load) load # One can access elements of loadings object similarly to a matrix load[1, 1] ``` - Mean vector corresponding to original data set. ```{r} decat_pca$center ``` - Scales are relevant when \texttt{cor = TRUE}. ```{r} decat_pca$scale ``` - Number of observations ```{r} decat_pca$n.obs ``` - Scores, that is, transformed variables given by $$ y_i = G^T(x_i - \bar x), \quad i = 1, 2, \ldots, 10, $$ where $G$ is the matrix of eigenvectors. ```{r} score <- decat_pca$scores head(score) dim(score) ``` - Lastly, \texttt{call} just gives the functions call. ```{r} decat_pca$call ``` ## b) How much variation is explained by $k$ principal components? This can be seen straight away with the \texttt{summary} function. See the "Cumulative proportion" row in the summary ```{r} summary(decat_pca) ``` One can also calculate this manually ```{r} vars <- decat_pca$sdev^2 var_prop <- vars / sum(vars) var_prop_cum <- cumsum(var_prop) var_prop # 2nd row of summary var_prop_cum # 3rd row of summary ``` Figure \@ref(fig:scree) shows so called scree plot. Scree plot can be used as a tool for choosing sufficient number of components. ```{r scree, fig.cap="Scree plot."} plot(decat_pca, las = 2, main = NULL) ``` Figure \@ref(fig:cumprop) can be also useful for choosing how many principal components to use. ```{r cumprop, fig.cap="Cumulative proportion of variance explained by $k$ principal components."} plot(var_prop_cum, type = "b", pch = 21, lty = 3, bg = "skyblue", cex = 1.5, ylim = c(0, 1), xlab = "Principal component", ylab = "Cumulative proportion of variance explained", xaxt = "n", yaxt = "n") axis(1, at = 1:10) axis(2, at = 0:10 / 10, las = 2) ``` There are many more or less heuristic methods for choosing number of principal components, for example, - include enough components to explain $x\%$ of total variation, where $x\%$ can chosen to be, e.g. $90\%$, - Kaiser criterion: exclude those principal components whose eigenvalues are less than average, - elbow method, among many others. ## c) Interpreting principal components ```{r biplot, fig.cap="Biplot of scores and loadings."} pc12 <- score[, 1:2] load12 <- load[, 1:2] pc_axis <- c(-max(abs(pc12)), max(abs(pc12))) ld_axis <- c(-0.8, 0.8) plot(pc12, xlim = pc_axis, ylim = pc_axis, pch = 21, bg = 8, cex = 1.25, xlab = paste0("PC 1 (", round(100 * var_prop[1], 2), "%)"), ylab = paste0("PC 2 (", round(100 * var_prop[2], 2), "%)")) par(new = T) plot(load12, axes = F, type = "n", xlab = "", ylab = "", xlim = ld_axis, ylim = ld_axis) axis(3, col = 2) axis(4, col = 2) arrows(0, 0, load12[, 1], load12[, 2], length = 0.1, col = 2) text(load12[, 1], load12[, 2], rownames(load12), pos = 3) abline(h = 0, lty = 3) abline(v = 0, lty = 3) ``` You can try below command by yourself. It gives similar results to above plot. ```{r eval=FALSE} biplot(decat_pca) ``` Loadings can be used to interpret principal components. More precisely, we want to determine which variables contribute to components. Note that often expert knowledge is needed for interpreting principal components and interpretations are subjective. Based on Figure \@ref(fig:biplot), it seems that variables \texttt{Discus\_throw} and \texttt{Shot\_put} have significance positive contributions to the first principal component. On the other hand, \texttt{R1500m} has significant negative contribution to the first component. Thus first principal component tells that decathletes who are good at running long distances are very different compared to decathletes that are good at discus throw and shot put. Consequently, we could interpret first principal component as strength or bulkiness. ## d) Calculate sample mean and covariance matrix from the score matrix Mean vector is zero vector. ```{r} colMeans(score) ``` Transformed variables $y_i$ are uncorrelated. Thus sample covariance matrix is a diagonal matrix. ```{r} cov(score) ``` Diagonal elements of covariance matrix are just the variances of principal components. ```{r} (n - 1) / n * diag(cov(score)) decat_pca$sdev^2 ``` # Problem 2: Eigendecomposition of a symmetric matrix ::: {.proposition #eigen} Let $A$ be a symmetric matrix with distinct eigenvalues. Show that the eigenvector matrix of $A$ is orthogonal ::: ::: {.proof name="Proof of Proposition \@ref(prp:eigen)"} Let $\lambda_i$ be the $i$th eigenvalue and $v_i$ the corresponding eigenvector of a symmetric $p \times p$ matrix $A$. The goal is to show that $v_i^Tv_j = 0$, $i \neq j$. We can order the eigenvalues such that $\lambda_1 > \lambda_2 > \ldots > \lambda_p$. The eigenvalues and -vectors satisfy $$ \begin{cases} A v_i = \lambda_i v_i \\ A v_j = \lambda_j v_j. \end{cases} $$ First, we multiply the first equation with $v_j^T$ from the left side, $$ \begin{split} &v_j^T A v_i = \lambda_i v_j^T v_i \\ &v_j^T A^T v_i = \lambda_i v_j^T v_i \\ &(A v_j)^T v_i = \lambda_i v_j^T v_i \\ &\lambda_j v_j^T v_i = \lambda_i v_j^T v_i \\ \Rightarrow &(\lambda_j - \lambda_i) v_j^T v_i = 0. \end{split} $$ Since $\lambda_i \neq \lambda_j$, vectors $v_j$ and $v_i$ have to be orthogonal: $v_j^T v_i = 0$, $i \neq j$. Hereby, the eigenvector matrix $V$ of $A$ satisfies $V V^T = I$, if we choose the eigenvectors of $A$ that have length one. :::