---
title: "Multivariate Statistical Analysis - Exercise Session 6"
date: "`r format(Sys.time(), '%d.%m.%Y')`"
output:
bookdown::pdf_document2:
extra_dependencies: ["float"]
latex_engine: xelatex
number_sections: no
toc: no
header-includes:
- \usepackage{amsmath}
- \usepackage{siunitx}
---
```{r, echo=FALSE}
# Figure placement
knitr::opts_chunk$set(fig.pos = "H", out.extra = "")
```
- Book "Correspondence analysis in practice" by Michael Greenacre gives an excellent introduction to correspondence analysis. Book focuses on interpretations and provides many examples. Book also includes an appendix about the package \texttt{ca}. Of course, it is not necessary to read the book for the course but it can be a good reference if, for example, you decide to use CA in your project.
# Problem 1: Bivariate correspondence analysis
Install package \texttt{ca} if you haven't yet. With package \texttt{ca} we can perform correspondence analysis.
```{r eval=FALSE}
install.packages("ca")
```
Next, read the data and import package \texttt{ca}.
```{r}
library(ca)
science <- read.table("SCIENCEDOCTORATES.txt", header = TRUE, sep = "\t",
row.names = 1)
science <- science[-nrow(science), -ncol(science)]
science
```
## $\chi^2$-test
As a first step, we perform $\chi^2$-test. Null hyphothesis and alternative hyphothesis of $\chi^2$-test are
$$
\begin{split}
&H_0: \text{Discipline and year are independent,} \\
&H_1: \text{Discipline and year are not independent.}
\end{split}
$$
First, we perform $\chi^2$-test manually.
```{r}
n <- sum(science)
v1 <- matrix(rowSums(science), ncol = 1)
v2 <- matrix(colSums(science), nrow = 1)
# Theoretical frequencies under independence
e <- v1 %*% v2 / n
# Calculate chi-squared statistic
chisq_statistic <- sum((science - e)^2 / e)
chisq_statistic
# p-value
i <- nrow(science)
j <- ncol(science)
pchisq(chisq_statistic, df = (i - 1) * (j - 1), lower.tail = FALSE)
```
Of course, R provides a function for performing $\chi^2$-test.
```{r}
chisq.test(science)
```
P-value is really small which suggest that there is dependence between rows and columns.
## Correspondence analysis with \texttt{ca} package
```{r}
science_ca <- ca(science)
names(science_ca)
```
Now let's go through the relevant parts of the output.
1. \texttt{sv}: Singular values of the matrix $Z$ from the lecture slides. Let $\sigma_i$ be a singular value of $Z$. Then $\sigma_i^2$ is an eigenvalue of the matrix $V = Z^TZ$ and matrix $W = ZZ^T$. Sum of eigenvalues equals to $\chi^2/n$, where $\chi^2$ is the chi-squared test statistic. This value $\chi^2/n$ is called *inertia* and it describes how much row/column profiles deviate from their average row/column profile. Eigenvalues also describe how much of the variation is explained by components.
```{r}
round(sum(science_ca$sv^2), 2) == round(chisq_statistic / n, 2)
science_ca$sv^2 / sum(science_ca$sv^2)
```
2. \texttt{rowmass}/\texttt{colmass}: Average column profile/row profile
```{r}
all(round(science_ca$rowmass, 2) == round(rowSums(science) / sum(science), 2))
all(round(science_ca$colmass, 2) == round(colSums(science) / sum(science), 2))
```
3. \texttt{rowdist}/\texttt{coldist} and \texttt{rowinertia}/\texttt{colinertia}: $\chi^2$ distances between rows/columns to average row/column profiles (centroid) are given by \texttt{rowdist}/\texttt{coldist}. Large $\chi^2$ distance indicates that corresponding row/column deviates from the average profile. Below we calculate $\chi^2$ distance between first row profile and average row profile manually. Also, we show that inertia is weighted sum of squared $\chi^2$ distances.
$$
\frac{\chi^2}{n} = \sum_i \underbrace{(\text{$i$th row mass}) \times
(\chi^2 \text{ distance from $i$th profile to centroid})^2}
_{\text{$i$th row inertia}}.
$$
```{r}
# Chi-squared distances
science_ca$rowdist
rowprof <- proportions(as.matrix(science), 1)
sqrt(sum((rowprof[1, ] - science_ca$colmass)^2 / science_ca$colmass))
# Inertia is equal to the sum of row inertias
sum(science_ca$rowinertia)
chisq_statistic / sum(science)
# Row inertias are weighted chi-squared distances
science_ca$rowmass * science_ca$rowdist^2
science_ca$rowinertia
```
4. \texttt{rowcoord}/\texttt{colcoord}: Standardized row and column coordinates or scores.
## Correspondence analysis manually
First let us form the matrix $Z$, $V$ and $W$ from the lecture slides.
```{r}
# Observed frequencies
f <- science / n
# Theoretical relative frequencies under independence
f1 <- matrix(rowSums(f), ncol = 1)
f2 <- matrix(colSums(f), nrow = 1)
e <- f1 %*% f2
# Matrices Z, V and W
z <- (f - e) / sqrt(e)
z <- as.matrix(z)
v <- t(z) %*% z
w <- z %*% t(z)
# Save nonzero eigenvalues, and eigenvectors of V and W
princip_inertia <- matrix(eigen(v)$values[1:7], nrow = 1)
u1 <- eigen(v)$vectors
u2 <- eigen(w)$vectors
```
Then let us check that eigenvalues of $V$ and $W$ correspond to principal inertias.
```{r}
eigen(v)$values
eigen(w)$values
science_ca$sv^2
```
Notice that $V$ and $W$ have the same nonzero eigenvalues. Next let us form matrices $R$ and $C$ from lecture slides
```{r}
# Matrix R
one <- matrix(rep(1, nrow(science)), ncol = 1)
shifting <- one %*% sqrt(f2)
scaling <- f1 %*% sqrt(f2)
r <- f / scaling - shifting
r <- as.matrix(r)
# Matrix C
one <- matrix(rep(1, ncol(science)), nrow = 1)
shifting <- sqrt(f1) %*% one
scaling <- sqrt(f1) %*% f2
c <- f / scaling - shifting
c <- as.matrix(c)
```
Now we can calculate row and column principal coordinates (scores).
```{r}
# Row principal coordinates
rowcoord <- r %*% u1
#omit the dimension corresponding to zero eigenvalue
rowcoord <- rowcoord[, 1:7]
# Column principal coordinates
colcoord <- t(c) %*% u2
# Omit dimensions corresponding to zero eigenvalues
colcoord <- colcoord[, 1:7]
```
Below we calculate so called standardized coordinates. These coordinates correspond to ones returned by function $\texttt{ca}$.
```{r}
# Standardized rowcoordinates
stand <- matrix(rep(1, nrow(science), ncol = 1)) %*% princip_inertia
standrowcoord <- rowcoord / sqrt(stand)
standrowcoord[, 1]
science_ca$rowcoord[, 1]
# Standardized colcoordinates
stand2 <- matrix(rep(1, ncol(science), ncol = 1)) %*% princip_inertia
standcolcoord <- colcoord / sqrt(stand2)
standcolcoord[, 1]
science_ca$colcoord[, 1]
```
# Interpretation and \texttt{summary(ca)}
Summary gives some additional information such as quality of representation and contributions of modalities to each axis. Below we go through some relevant parts of the summary.
```{r}
summary(science_ca)
```
1. \texttt{ctr}: Contributions of modalities to the construction of the $i$th CA axis. Below we calculate contribution of modality \texttt{Engineering} to the second principal axis.
```{r}
1000 * f1[1] * (rowcoord[1, 2])^2 / princip_inertia[2]
```
2. \texttt{cor} and \texttt{qlt}: Column \texttt{cor} gives QLT of rows/columns with respect to $i$th axis. However, column \texttt{qlt} gives quality of representation with respect to the plane spanned by the first two axes. Below we calculate quality of representation of row \texttt{Engineering} by second CA axis.
```{r}
1000 * (rowcoord[1, 2])^2 / science_ca$rowdist[1]^2
```
3. \texttt{k=i}: Principal row/column coordinates.
4. \texttt{inr}: Row/column inertias divided by the total inertia.
Function \texttt{plot.ca} provides many options for scaling.
(ref:capsymmetric) \texttt{map == "symmetric"}
(ref:caprowprincipal) \texttt{map == "rowprincipal"}
(ref:caprowgreen) \texttt{map == "rowgreen"}
(ref:caprowgab) \texttt{map == "rowgab"}
1. \texttt{map == "symmetric"}: Here one *cannot* make any conclusions based on the distances between columns and rows. However, row-to-row distances approximate $\chi^2$ distances between rows. Also, row-to-origin distances approximate $\chi^2$ distances between corresponding row and the average row profile. Same interpretations can be made for col-to-col distances.
```{r symmetric, fig.cap="(ref:capsymmetric)"}
plot(science_ca, map = "symmetric")
```
2. \texttt{map == "rowprincipal"}: Here rows are in principal coordinates (representing row profiles) and columns are in standard coordinates (representing column vertices). Here row-to-row distances approximate $\chi^2$ distances between rows. Distances between rows and columns describe row profiles. Closer a profile is to vertex, the higher its profile is for that category, assuming the first two axes account for large proportion of inertia.
```{r rowprincipal, fig.cap="(ref:caprowprincipal)"}
plot(science_ca, map = "rowprincipal")
```
As Figure \@ref(fig:rowprincipal) shows, the disadvantage of \texttt{rowprincipal/colprincipal} plot is that profiles points tend to get clustered in the middle (especially when inertia is small) and the visualization can be hard to read.
\texttt{map == "rowgreen"}/\texttt{map == "rowgab"}: Rows are in principal coordinates and columns are in standard coordinates multiplied by the square root of the mass/the mass of the corresponding point. Again, row-to-row distances approximate $\chi^2$ distances between rows. There is no interpretation for row-to-column distances. However, under 90 degree angle between row and column implies attraction between modalities and over 90 degree angles imply repulsion between modalities, assuming good quality of display. Point of options \texttt{"rowgreen"}/\texttt{"rowgab"} is to scale vertices in such a way that biplot becomes more readable compared to \texttt{rowprincipal}. Note that with the option \texttt{rowprincipal} angles can be interpreted similarly as with options \texttt{"rowgreen"}/\texttt{"rowgab"}.
```{r rowgreen, fig.cap="(ref:caprowgreen)"}
plot(science_ca, map = "rowgreen", arrows = c(FALSE, TRUE))
```
```{r rowgab, fig.cap="(ref:caprowgab)"}
plot(science_ca, map = "rowgab", arrows = c(FALSE, TRUE))
```
\texttt{colprincipal}, \texttt{colgab}, \texttt{colgreen} can be interpreted similarly as \texttt{rowprincipal}, \texttt{rowgab}, \texttt{rowgreen} respectively, but from the viewpoint of column profiles.
From above plots one can try to interpret two first CA axes:
- 1st axis: Hard sciences vs. soft sciences
- 2nd axis: Formula heavy sciences vs. experimental sciences
# Problem 2: Association between column and row profiles
First, we show that $Z^TZ$ and $ZZ^T$ have the same eigenvalues. From the definition of an eigenvalue and -vector we get
\begin{equation} (\#eq:eigen1)
Vv_i = Z^TZv_i = \lambda_iv_i
\end{equation}
and
\begin{equation} (\#eq:eigen2)
Ww_i = ZZ^Tw_i = \mu_iw_i.
\end{equation}
Multiply Equation \@ref(eq:eigen1) with $Z$ from the left side and note that $V = Z^TZ$,
$$
\Rightarrow ZVv_i = Z\underbrace{Z^TZv_i}_{=\lambda_iv_i} = \lambda_i Z v_i.
$$
Thus
$$
\begin{split}
&ZZ^T(Zv_i) = \lambda_i (Zv_i) \\
&\Rightarrow ZZ^T v_i^* = \lambda_i v_i^*,
\quad\text{where} \ v_i^* = Zv_i.
\end{split}
$$
Hereby, $\lambda_i$ is the eigenvalue of $ZZ^T = W$ with eigenvector $Zv_i$. The squared length of the eigenvector is given by
$$
\| Zv_i \|_2^2 = (Zv_i)^T(Zv_i) = v_i^T\underbrace{Z^TZv_i}_{=\lambda_i v_i}
= \lambda_i v_i^Tv_i = \lambda_i.
$$
Thus
$$
w_i = \frac{1}{\| Zv_i \|_2} Zv_i = \frac{1}{\sqrt{\lambda_i}} Zv_i.
$$
The same proof goes to the other direction also. Just start by multiplying Equation of \@ref(eq:eigen2) by $Z^T$ and proceed similarly.