---
title: "Multivariate Statistical Analysis - Exercise Session 7"
date: "`r format(Sys.time(), '%d.%m.%Y')`"
output:
bookdown::pdf_document2:
extra_dependencies: ["float"]
latex_engine: xelatex
number_sections: no
toc: no
header-includes:
- \usepackage{amsmath}
- \usepackage{siunitx}
---
```{r, echo=FALSE}
# Figure placement
knitr::opts_chunk$set(fig.pos = "H", out.extra = "")
```
- Book "Correspondence analysis in practice" by Michael Greenacre gives an excellent introduction to correspondence analysis/multiple correspondence analysis. Book focuses on interpretations and provides many examples. Book also includes an appendix about the package \texttt{ca}. Of course, it is not necessary to read the book for the course but it can be a good reference if, for example, you decide to use CA/MCA in your project.
# Problem 1: Multiple correspondence analysis
Read the data and import package \texttt{ca}.
```{r}
library(ca)
tea <- read.table("TEA.txt", header = TRUE, sep = "\t")
dim(tea)
head(tea)
```
Now we perform *multiple correspondence analysis* (MCA) for the data set \texttt{tea} with the function \texttt{mjca} from the package \texttt{ca}. There are multiple almost equivalent ways to define MCA. One way to define MCA is that it is CA performed for *complete disjunctive table* (*indicator matrix*). We can perform this version of MCA by setting \texttt{lambda = "indicator"}. Argument \texttt{reti} controls whether the complete disjunctive table is returned.
```{r}
tea_mca <- mjca(tea, lambda = "indicator", reti = TRUE)
names(tea_mca)
```
Explanations for most of the returned objects are already explained on file \texttt{6session.pdf}. Objects \texttt{Burt}, \texttt{Burt.upd}, \texttt{subinertia} and \texttt{JCA.iter} are related to other definitions of MCA and are not relevant here.
By default \texttt{summary(tea\_mca)} only gives summary for columns. By setting \texttt{rows = TRUE} one can see also summary for rows. More info can be found from the help pages \texttt{?summary.mjca}.
```{r}
s <- summary(tea_mca)
s
```
Figure \@ref(fig:scree) shows that only `r s$scree[2,4]`\% of variation is explained by the first two components. Nevertheless, we proceed to analyze the first two components.
```{r scree, fig.cap="Scree plot."}
barplot(s$scree[, 3], ylim = c(0, 20), names.arg = paste("PC", 1:11), las = 2,
xlab = "Component", ylab = "% of variation explained", col = "skyblue")
```
By default \texttt{plot.mjca} plots only column scores. By modifying argument \texttt{what} one can specify whether row/column scores are plotted. Again, for more information see help pages \texttt{?plot.mjca}.
```{r col, fig.cap="First two column principal coordinates."}
plot(tea_mca, arrows = c(TRUE, TRUE))
```
From relation
$$
d_{p_1l_1, p_2l_2} \approx 1 + \sum_{h=1}^2 \psi_{h,p_1l_1}\psi_{h,p_2l_2}
$$
we get interpretation for Figure \@ref(fig:col):
- angle between modalities less than 90 degrees = attraction,
- angle between modalities more than 90 degrees = repulsion and
- angle between modalities 90 degrees = independent.
Remember that interpretations are only valid when modalities are represented well in two dimensions. Thus we could modify Figure \@ref(fig:col) in such a way that point size represents quality of representation of corresponding modality.
```{r qlt, fig.cap="First two column principal coordinates. Point sizes are scaled according to quality of representation."}
# Function for scaling values from 0 to 1 (this is for visualization purposes):
normalize <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
# Generate the scatter plot. Point size is now scaled according to qlt:
qlt <- s$columns[, 3]
tea_covariates <- tea_mca$colpcoord[, 1:2]
plot(tea_covariates, xlim = c(-2.5, 1), ylim = c(-1.5, 2.5), pch = 21,
bg = "red", cex = normalize(qlt) + 1,
xlab = paste0("Dimension 1", " (", s$scree[1, 3], "%", ")"),
ylab = paste0("Dimension 2", " (", s$scree[2, 3], "%", ")"))
# Add arrows. Slight transparency is added to increase visibility.
arrows(rep(0, 17), rep(0, 17), tea_covariates[, 1], tea_covariates[, 2],
length = 0, col = rgb(1, 0, 0, 0.25))
# "Cross-hair" is added, i.e., dotted lines crossing x and y axis at 0.
abline(h = 0, v = 0, lty = 3)
# Add variable:category names to the plot.
text(tea_covariates, tea_mca$levelnames, pos = 2, cex = 0.75)
```
Figure \@ref(fig:ca) illustrates that MCA is just CA performed for complete disjunctive table. That is, Figures \@ref(fig:col) and \@ref(fig:ca) are identical.
```{r ca, fig.cap="Correspondence analysis performed for complete disjunctive table."}
plot(ca(tea_mca$indmat), arrows = c(FALSE, TRUE), what = c("none", "all"))
```
Rows can be analyzed similarly to columns. From relation
$$
d_{i_1, i_2} \approx 1 + \sum_{h=1}^2 \phi_{h, i_1}\phi_{h, i_2}
$$
we get interpretation for Figure \@ref(fig:row):
- angle between individuals less than 90 degrees = similar profiles and
- angle between individuals more than 90 degrees = profiles differ.
For the sake of clarity, observation labels are dropped from Figure \@ref(fig:row) and instead of arrows we have points.
```{r row, fig.cap="First two row principal coordinates."}
plot(tea_mca, arrows = c(FALSE, FALSE), what = c("all", "none"),
labels = c(0, 0))
```
Lastly, relation
$$
d_{i, pl} \approx 1 + \sum_{h=1}^2 \hat\phi_{h, i}\psi_{h, pl}
$$
gives interpretation for Figure \@ref(fig:colrow). Notice that since columns are in principal coordinates we can also interpret angles between columns in Figure \@ref(fig:colrow) as in Figure \@ref(fig:col).
```{r colrow, fig.cap="Columns in principal coordinates and rows in standard coordinates."}
plot(tea_mca, arrows = c(FALSE, TRUE), what = c("all", "all"),
map = "colprincipal", labels = c(0, 2))
```
# Problem 2: The trace of matrix $V$
First, let us review some notation
$$
\begin{split}
&n = \textrm{sample size},
\quad K = \textrm{total number of modalities},
\quad K_p = \textrm{number of modalities of $p$th variable}, \\
&P = \textrm{number of qualitive variables},
\quad n_{pl} = \textrm{number of individuals having modality $l$ of
variable $Y_p$}, \\
&x_{ipl} = \begin{cases}
1, \ \textrm{if individual $i$ has modality $l$ of variable $Y_p$} \\
0, \ \textrm{otherwise} \\
\end{cases}.
\end{split}
$$
Notice that we have
$$
\begin{split}
&\sum_{p=1}^{P} \sum_{l=1}^{K_p} x_{ipl} = P, \quad
\sum_{p=1}^{P} \sum_{l=1}^{K_p} n_{pl} = nP \quad\textrm{and} \\
&\sum_{i=1}^{n} x_{ipl} = n_{pl}.
\end{split}
$$
Above relations will be useful in the proof. Remember also that matrix $T \in\mathbb{R}^{n \times K}$ is defined as
$$
T = \begin{pmatrix}
t_{1,1} & \cdots & t_{1,K} \\
\vdots & \ddots & \vdots \\
t_{n,1} & \cdots & t_{n,K}
\end{pmatrix}, \quad\textrm{where} \
t_{i,pl} = \frac{x_{ipl} - n_{pl}/n}{\sqrt{Pn_{pl}}}.
$$
We have that $V = T^TT$ and
$$
T^T = \begin{pmatrix}
t_{1,1} & \cdots & t_{n,1} \\
\vdots & \ddots & \vdots \\
t_{1,K} & \cdots & t_{n,K}
\end{pmatrix}.
$$
Thus
$$
\textrm{diag}(V) = \textrm{diag}(T^TT) =
\begin{pmatrix}
t_{1,1}^2 + t_{2,1}^2 + \dots + t_{n,1}^2 \\
t_{1,2}^2 + t_{2,2}^2 + \dots + t_{n,2}^2 \\
\vdots \\
t_{1,K}^2 + t_{2,K}^2 + \dots + t_{n,K}^2
\end{pmatrix}.
$$
Then,
$$
\begin{split}
&\textrm{Trace}(V)
= \sum_{m=1}^{K}\sum_{i=1}^{n} t_{i,m}^2
= \sum_{i=1}^{n}\sum_{m=1}^{K} t_{i,m}^2
= \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} t_{i,pl}^2 \\
&= \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p}
\left(\frac{x_{ipl} - n_{pl}/n}{\sqrt{Pn_{pl}}}\right)^2
= \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p}
\left(\frac{x_{ipl}^2 - 2x_{ipl}\frac{n_{pl}}{n} +
\frac{n_{pl}^2}{n^2}}{Pn_{pl}}\right) \\
&= \frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p}
\left(\frac{x_{ipl}^2}{n_{pl}} - 2\frac{x_{ipl}}{n}
+ \frac{n_{pl}}{n^2} \right).
\end{split}
$$
Then consider the terms of the sum separately. For the second term, we have
$$
\frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p}
\left(-2\frac{x_{ipl}}{n}\right)
= \frac{-2}{Pn} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} x_{ipl}
= \frac{-2}{Pn} nP = -2.
$$
Likewise, for the third term we have
$$
\frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} \frac{n_{pl}}{n^2}
= \frac{1}{Pn^2} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p} n_{pl}
= \frac{1}{Pn^2}\sum_{i=1}^{n} nP = 1.
$$
The first term is the most difficult one here. Note that $x_{ipl} = x_{ipl}^2$, since $x_{ipl}\in \{0,1\}$. By opening the sums we get
$$
\begin{split}
&\frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P}\sum_{l=1}^{K_p}
\frac{x_{ipl}}{n_{pl}}
= \frac{1}{P} \sum_{i=1}^{n}\sum_{p=1}^{P} \left(
\frac{x_{ip1}}{n_{p1}} + \frac{x_{ip2}}{n_{p2}} +
\dots + \frac{x_{ipK_p}}{n_{pK_p}}\right) \\
&= \frac{1}{P} \sum_{i=1}^{n} \left(
\frac{x_{i11}}{n_{11}} + \frac{x_{i12}}{n_{12}} +
\dots + \frac{x_{i1K_1}}{n_{1K_1}} + \frac{x_{i21}}{n_{21}} +
\dots + \frac{x_{iPK_P}}{n_{PK_P}} \right) \\
&= \frac{1}{P} \left( \frac{1}{n_{11}}\sum_{i=1}^n x_{i11}
+ \frac{1}{n_{12}}\sum_{i=1}^n x_{i12} + \dots +
\frac{1}{n_{PK_P}}\sum_{i=1}^n x_{iPK_P}\right) \\
&= \frac{1}{P}\left(\frac{n_{11}}{n_{11}} + \frac{n_{12}}{n_{12}} + \dots +
\frac{n_{PK_P}}{n_{PK_P}}\right) = \frac{K}{P}.
\end{split}
$$
By combining all the terms we get
$$
\mathrm{Trace}(V) = \frac{K}{P} - 2 + 1 = \frac{K}{P} - 1.
$$