--- title: "Multivariate Statistical Analysis - Exercise Session 7" date: "`r format(Sys.time(), '%d.%m.%Y')`" output: bookdown::pdf_document2: extra_dependencies: ["float"] latex_engine: xelatex number_sections: no toc: no header-includes: - \usepackage{amsmath} - \usepackage{siunitx} --- ```{r, echo=FALSE} # Figure placement knitr::opts_chunk$set(fig.pos = "H", out.extra = "") ``` - Book "Correspondence analysis in practice" by Michael Greenacre gives an excellent introduction to correspondence analysis/multiple correspondence analysis. Book focuses on interpretations and provides many examples. Book also includes an appendix about the package \texttt{ca}. Of course, it is not necessary to read the book for the course but it can be a good reference if, for example, you decide to use CA/MCA in your project. # Problem 1: Multiple correspondence analysis Read the data and import package \texttt{ca}. ```{r} library(ca) tea <- read.table("TEA.txt", header = TRUE, sep = "\t") dim(tea) head(tea) ``` Now we perform *multiple correspondence analysis* (MCA) for the data set \texttt{tea} with the function \texttt{mjca} from the package \texttt{ca}. There are multiple almost equivalent ways to define MCA. One way to define MCA is that it is CA performed for *complete disjunctive table* (*indicator matrix*). We can perform this version of MCA by setting \texttt{lambda = "indicator"}. Argument \texttt{reti} controls whether the complete disjunctive table is returned. ```{r} tea_mca <- mjca(tea, lambda = "indicator", reti = TRUE) names(tea_mca) ``` Explanations for most of the returned objects are already explained on file \texttt{6session.pdf}. Objects \texttt{Burt}, \texttt{Burt.upd}, \texttt{subinertia} and \texttt{JCA.iter} are related to other definitions of MCA and are not relevant here. By default \texttt{summary(tea\_mca)} only gives summary for columns. By setting \texttt{rows = TRUE} one can see also summary for rows. More info can be found from the help pages \texttt{?summary.mjca}. ```{r} s <- summary(tea_mca) s ``` Figure \@ref(fig:scree) shows that only `r s$scree[2,4]`\% of variation is explained by the first two components. Nevertheless, we proceed to analyze the first two components. ```{r scree, fig.cap="Scree plot."} barplot(s$scree[, 3], ylim = c(0, 20), names.arg = paste("PC", 1:11), las = 2, xlab = "Component", ylab = "% of variation explained", col = "skyblue") ``` By default \texttt{plot.mjca} plots only column scores. By modifying argument \texttt{what} one can specify whether row/column scores are plotted. Again, for more information see help pages \texttt{?plot.mjca}. ```{r col, fig.cap="First two column principal coordinates."} plot(tea_mca, arrows = c(TRUE, TRUE)) ``` From relation $$ d_{p_1l_1, p_2l_2} \approx 1 + \sum_{h=1}^2 \psi_{h,p_1l_1}\psi_{h,p_2l_2} $$ we get interpretation for Figure \@ref(fig:col): - angle between modalities less than 90 degrees = attraction, - angle between modalities more than 90 degrees = repulsion and - angle between modalities 90 degrees = independent. Remember that interpretations are only valid when modalities are represented well in two dimensions. Thus we could modify Figure \@ref(fig:col) in such a way that point size represents quality of representation of corresponding modality. ```{r qlt, fig.cap="First two column principal coordinates. Point sizes are scaled according to quality of representation."} # Function for scaling values from 0 to 1 (this is for visualization purposes): normalize <- function(x) { (x - min(x)) / (max(x) - min(x)) } # Generate the scatter plot. Point size is now scaled according to qlt: qlt <- s$columns[, 3] tea_covariates <- tea_mca$colpcoord[, 1:2] plot(tea_covariates, xlim = c(-2.5, 1), ylim = c(-1.5, 2.5), pch = 21, bg = "red", cex = normalize(qlt) + 1, xlab = paste0("Dimension 1", " (", s$scree[1, 3], "%", ")"), ylab = paste0("Dimension 2", " (", s$scree[2, 3], "%", ")")) # Add arrows. Slight transparency is added to increase visibility. arrows(rep(0, 17), rep(0, 17), tea_covariates[, 1], tea_covariates[, 2], length = 0, col = rgb(1, 0, 0, 0.25)) # "Cross-hair" is added, i.e., dotted lines crossing x and y axis at 0. abline(h = 0, v = 0, lty = 3) # Add variable:category names to the plot. text(tea_covariates, tea_mca$levelnames, pos = 2, cex = 0.75) ``` Figure \@ref(fig:ca) illustrates that MCA is just CA performed for complete disjunctive table. That is, Figures \@ref(fig:col) and \@ref(fig:ca) are identical. ```{r ca, fig.cap="Correspondence analysis performed for complete disjunctive table."} plot(ca(tea_mca$indmat), arrows = c(FALSE, TRUE), what = c("none", "all")) ``` Rows can be analyzed similarly to columns. From relation $$ d_{i_1, i_2} \approx 1 + \sum_{h=1}^2 \phi_{h, i_1}\phi_{h, i_2} $$ we get interpretation for Figure \@ref(fig:row): - angle between individuals less than 90 degrees = similar profiles and - angle between individuals more than 90 degrees = profiles differ. For the sake of clarity, observation labels are dropped from Figure \@ref(fig:row) and instead of arrows we have points. ```{r row, fig.cap="First two row principal coordinates."} plot(tea_mca, arrows = c(FALSE, FALSE), what = c("all", "none"), labels = c(0, 0)) ``` Lastly, relation $$ d_{i, pl} \approx 1 + \sum_{h=1}^2 \hat\phi_{h, i}\psi_{h, pl} $$ gives interpretation for Figure \@ref(fig:colrow). Notice that since columns are in principal coordinates we can also interpret angles between columns in Figure \@ref(fig:colrow) as in Figure \@ref(fig:col). ```{r colrow, fig.cap="Columns in principal coordinates and rows in standard coordinates."} plot(tea_mca, arrows = c(FALSE, TRUE), what = c("all", "all"), map = "colprincipal", labels = c(0, 2)) ``` # Problem 2: The trace of matrix $V$ First, let us review some notation $$ \begin{split} &n = \textrm{sample size}, \quad K = \textrm{total number of modalities}, \quad K_p = \textrm{number of modalities of $p$th variable}, \\ &P = \textrm{number of qualitive variables}, \quad n_{pl} = \textrm{number of individuals having modality $l$ of variable $Y_p$}, \\ &x_{ipl} = \begin{cases} 1, \ \textrm{if individual $i$ has modality $l$ of variable $Y_p$} \\ 0, \ \textrm{otherwise} \\ \end{cases}. \end{split} $$ Notice that we have $$ \begin{split} &\sum_{p=1}^{P} \sum_{l=1}^{K_p} x_{ipl} = P, \quad \sum_{p=1}^{P} \sum_{l=1}^{K_p} n_{pl} = nP \quad\textrm{and} \\ &\sum_{i=1}^{n} x_{ipl} = n_{pl}. \end{split} $$ Above relations will be useful in the proof. Remember also that matrix $T \in\mathbb{R}^{n \times K}$ is defined as $$ T = \begin{pmatrix} t_{1,1} & \cdots & t_{1,K} \\ \vdots & \ddots & \vdots \\ t_{n,1} & \cdots & t_{n,K} \end{pmatrix}, \quad\textrm{where} \ t_{i,pl} = \frac{x_{ipl} - n_{pl}/n}{\sqrt{Pn_{pl}}}. $$ We have that $V = T^TT$ and $$ T^T = \begin{pmatrix} t_{1,1} & \cdots & t_{n,1} \\ \vdots & \ddots & \vdots \\ t_{1,K} & \cdots & t_{n,K} \end{pmatrix}. $$ Thus $$ \textrm{diag}(V) = \textrm{diag}(T^TT) = \begin{pmatrix} t_{1,1}^2 + t_{2,1}^2 + \dots + t_{n,1}^2 \\ t_{1,2}^2 + t_{2,2}^2 + \dots + t_{n,2}^2 \\ \vdots \\ t_{1,K}^2 + t_{2,K}^2 + \dots + t_{n,K}^2 \end{pmatrix}. $$ Then, $$ \begin{split} &\textrm{Trace}(V) = \sum_{m=1}^{K}\sum_{i=1}^{n} t_{i,m}^2 = \sum_{i=1}^{n}\sum_{m=1}^{K} t_{i,m}^2 = \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} t_{i,pl}^2 \\ &= \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} \left(\frac{x_{ipl} - n_{pl}/n}{\sqrt{Pn_{pl}}}\right)^2 = \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} \left(\frac{x_{ipl}^2 - 2x_{ipl}\frac{n_{pl}}{n} + \frac{n_{pl}^2}{n^2}}{Pn_{pl}}\right) \\ &= \frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} \left(\frac{x_{ipl}^2}{n_{pl}} - 2\frac{x_{ipl}}{n} + \frac{n_{pl}}{n^2} \right). \end{split} $$ Then consider the terms of the sum separately. For the second term, we have $$ \frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} \left(-2\frac{x_{ipl}}{n}\right) = \frac{-2}{Pn} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} x_{ipl} = \frac{-2}{Pn} nP = -2. $$ Likewise, for the third term we have $$ \frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} \frac{n_{pl}}{n^2} = \frac{1}{Pn^2} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} n_{pl} = \frac{1}{Pn^2}\sum_{i=1}^{n} nP = 1. $$ The first term is the most difficult one here. Note that $x_{ipl} = x_{ipl}^2$, since $x_{ipl}\in \{0,1\}$. By opening the sums we get $$ \begin{split} &\frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} \frac{x_{ipl}}{n_{pl}} = \frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P} \left( \frac{x_{ip1}}{n_{p1}} + \frac{x_{ip2}}{n_{p2}} + \dots + \frac{x_{ipK_p}}{n_{pK_p}}\right) \\ &= \frac{1}{P} \sum_{i=1}^{n} \left( \frac{x_{i11}}{n_{11}} + \frac{x_{i12}}{n_{12}} + \dots + \frac{x_{i1K_1}}{n_{1K_1}} + \frac{x_{i21}}{n_{21}} + \dots + \frac{x_{iPK_P}}{n_{PK_P}} \right) \\ &= \frac{1}{P} \left( \frac{1}{n_{11}}\sum_{i=1}^n x_{i11} + \frac{1}{n_{12}}\sum_{i=1}^n x_{i12} + \dots + \frac{1}{n_{PK_P}}\sum_{i=1}^n x_{iPK_P}\right) \\ &= \frac{1}{P}\left(\frac{n_{11}}{n_{11}} + \frac{n_{12}}{n_{12}} + \dots + \frac{n_{PK_P}}{n_{PK_P}}\right) = \frac{K}{P}. \end{split} $$ By combining all the terms we get $$ \mathrm{Trace}(V) = \frac{K}{P} - 2 + 1 = \frac{K}{P} - 1. $$