---
title: "Multivariate Statistical Analysis - Exercise Session 8"
date: "`r format(Sys.time(), '%d.%m.%Y')`"
output:
bookdown::pdf_document2:
extra_dependencies: ["float"]
latex_engine: xelatex
number_sections: no
toc: no
header-includes:
- \usepackage{amsmath}
- \usepackage{siunitx}
---
\newcommand{\cor}{\mathrm{Cor}}
\newcommand{\cov}{\mathrm{Cov}}
\newcommand{\var}{\mathrm{Var}}
\newcommand{\std}{\mathrm{Std}}
```{r, echo=FALSE}
# Figure placement
knitr::opts_chunk$set(fig.pos = "H", out.extra = "")
```
# Problem 1: Canonical correlation analysis
First we read the data and define groups $X$ and $Y$.
```{r}
car <- read.table("CAR.txt", header = TRUE, sep = "\t")
dim(car)
head(car)
# X = (Price, Value)
x <- as.matrix(car[, c(6, 5)])
# Y = (Economy, Service, Desing, Sport, Safety, Easy.h)
y <- as.matrix(car[, c(3, 4, 7:10)])
xy <- cbind(x, y)
rownames(xy) <- paste(car$Type, car$Model)
head(xy)
```
## a) Compute the sample canonical vectors with the corrected scaling
Let
$$
z = (x^T,y^T)^T,
$$
and let
$$
\cov(z) = \Sigma =
\begin{pmatrix}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{21} & \Sigma_{22}
\end{pmatrix}.
$$
Define
$$
\begin{split}
&M_1 = \Sigma_{11}^{-1} \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}
\quad\text{and} \\
&M_2 = \Sigma_{22}^{-1} \Sigma_{21} \Sigma_{11}^{-1} \Sigma_{12}.
\end{split}
$$
Now the canonical vectors $\alpha_k$ are the eigenvectors of $M_1$ ($\alpha_k$ corresponds to the $k$th largest eigenvalue), and the canonical vectors $\beta_k$ are the eigenvectors of $M_2$. First we compute $M_1$ and $M_2$.
```{r}
r <- cov(xy)
r11 <- r[1:2, 1:2] # cov(X)
r22 <- r[3:8, 3:8] # cov(Y)
r21 <- r[3:8, 1:2] # cov(Y,X)
r12 <- r[1:2, 3:8] # cov(X,Y)
r11_inv <- solve(r11)
r22_inv <- solve(r22)
m1 <- r11_inv %*% r12 %*% r22_inv %*% r21
m2 <- r22_inv %*% r21 %*% r11_inv %*% r12
```
Now we can compute *unscaled* canonical vectors.
```{r}
alpha1 <- eigen(m1)$vectors[, 1]
alpha2 <- eigen(m1)$vectors[, 2]
beta1 <- eigen(m2)$vectors[, 1]
beta2 <- eigen(m2)$vectors[, 2]
```
We want to scale canonical vectors such that
$$
\var(\alpha_k^T x) = 1 = \var(\beta_k^T y).
$$
- How to get the correct scales?
Let $\tilde\alpha_k$ denote unscaled canonical vector. Then we get correctly scaled canonical vector $\alpha_k$ by
$$
\alpha_k = \frac{1}{\std(\tilde\alpha_k^T x)} \tilde\alpha_k,
$$
since now
$$
\var(\alpha_k x)
= \var\left(\frac{1}{\std(\tilde\alpha_k^T x)} \tilde\alpha_k x\right)
= \left(\frac{1}{\std(\tilde\alpha_k^T x)}\right)^2 \var(\tilde\alpha_k^T x)
= 1.
$$
Additionally, by affine equivariance we have
$$
\begin{split}
&\var(\tilde\alpha_k^T x) = \tilde\alpha_k^T \cov(x) \tilde\alpha_k
= \tilde\alpha_k^T \Sigma_{11} \tilde\alpha_k \\
&\Rightarrow \std(\tilde\alpha_k^T x) = \sqrt{\tilde\alpha_k^T
\Sigma_{11} \tilde\alpha_k}.
\end{split}
$$
Similar calculations can be performed for $\beta_k$. Thus correctly scaled canonical vectors are given by
```{r}
alpha1 <- alpha1 / sqrt((alpha1 %*% r11 %*% alpha1)[1, 1])
alpha2 <- alpha2 / sqrt((alpha2 %*% r11 %*% alpha2)[1, 1])
beta1 <- beta1 / sqrt((beta1 %*% r22 %*% beta1)[1, 1])
beta2 <- beta2 / sqrt((beta2 %*% r22 %*% beta2)[1, 1])
```
## b) Score vectors
Sample scores can be calculated as
$$
\eta_{ki} = \alpha_k^T x_i \quad\text{and}\ \phi_{ki} = \beta_k^T y_i,
$$
where $x_i$ is the $i$th row of $X$ and $y_i$ is the $i$th row of $Y$. More coincisely,
$$
\eta_k = X \alpha_k \quad\text{and}\ \phi_k = Y \beta_k.
$$
```{r}
eta1 <- x %*% alpha1
eta2 <- x %*% alpha2
phi1 <- y %*% beta1
phi2 <- y %*% beta2
```
Now we can check that
$$
\var(\eta_k) = 1 = \var(\phi_k),
$$
$$
\begin{cases}
\cor(\eta_1, \eta_2) = 0, \\
\cor(\phi_1, \phi_2) = 0
\end{cases}
$$
and that
$$
\begin{cases}
\cor(\eta_1, \phi_1) = \sqrt{\lambda_1}, \\
\cor(\eta_2, \phi_2) = \sqrt{\lambda_2}.
\end{cases}
$$
```{r}
c(var(eta1), var(eta2), var(phi1), var(phi2))
c(cor(eta1, eta2), cor(phi1, phi2))
c(cor(eta1, phi1), cor(eta2, phi2))
sqrt(eigen(m1)$values)
sqrt(eigen(m2)$values[1:2])
```
So canonical correlations are $\rho_1 = `r round(cor(eta1, phi1), 2)`$ and $\rho_1 = `r round(cor(eta2, phi2), 2)`$.
## c) Interpret the first pair of canonical variables
```{r, include=FALSE}
a_names <- paste("\\times", c("\\textbf{Price}", "\\textbf{Value}"))
a <- round(alpha1, 2)
a1 <- paste(ifelse(a[-1] >= 0, "+", "-"), abs(a[-1]))
a1 <- c(paste(ifelse(a[1] >= 0, "", "-"), abs(a[1])), a1)
a1 <- paste(a1, a_names, collapse = " ")
b_names <- paste("\\times", c("\\textbf{Economy}", "\\textbf{Service}",
"\\textbf{Design}", "\\textbf{Sport}",
"\\textbf{Safety}", "\\textbf{Easy h.}"))
b <- round(beta1, 2)
b1 <- paste(ifelse(b[-1] >= 0, "+", "-"), abs(b[-1]))
b1 <- c(paste(ifelse(b[1] >= 0, "", "-"), abs(b[1])), b1)
b1 <- paste(b1, b_names, collapse = " ")
```
For the first pair of canonical variables we got
$$
\begin{split}
&\eta_1 = `r a1` \\
&\phi_1 = `r b1`.
\end{split}
$$
Remember that scales for the variables are
$$
1 = \text{very good}\ \text{and}\ 6 = \text{very bad}.
$$
For example, $\textbf{Value} = 1$ means that the car loses its value slowly, which is a good thing. On the contrary, cars with $\textbf{Value} = 6$ lose value very fast.
First let's interpret $x$-axis of Figure \@ref(fig:1score). Based on the weights for \textbf{Price} and \textbf{Value} we have very expensive but valuable cars on the right. On the other hand, cheap cars that lose value fast are on the left. Note also, that \textbf{Value} has almost twice as much weight in the scores compared to \textbf{Price}. All in all, $x$-axis could be interpreted as *Value index of the car* and cars on the right can be considered as worthy investments.
We can interpret $y$-axis similarly. Variable \textbf{Design} has almost negligible weight. However, \textbf{Economy} has positive contribution to scores and other variables have negative weights. Thus uppermost cars on Figure \@ref(fig:1score) have very good \textbf{Service}, \textbf{Sport}, \textbf{Safety} and \textbf{Easy h.} but bad \textbf{Economy}. Therefore, uppermost cars use a lot of fuel but have otherwise good qualities (vice verce for lowest cars). All in all, $y$-axis can be interpreted as *Quality of the car*.
```{r 1score, fig.cap="Scores corresponding to the first pair of canonical variables."}
plot(eta1, phi1, xlab = expression(eta[1]),
ylab = expression(phi[1]), pch = NA)
text(eta1, phi1, labels = paste(car$Type, car$Model))
```
## d) Interpret the second pair of canonical variables
```{r, include=FALSE}
a <- round(alpha2, 2)
a2 <- paste(ifelse(a[-1] >= 0, "+", "-"), abs(a[-1]))
a2 <- c(paste(ifelse(a[1] >= 0, "", "-"), abs(a[1])), a2)
a2 <- paste(a2, a_names, collapse = " ")
b <- round(beta2, 2)
b2 <- paste(ifelse(b[-1] >= 0, "+", "-"), abs(b[-1]))
b2 <- c(paste(ifelse(b[1] >= 0, "", "-"), abs(b[1])), b2)
b2 <- paste(b2, b_names, collapse = " ")
```
For the second pair of canonical variables we got
$$
\begin{split}
&\eta_2 = `r a2` \\
&\phi_2 = `r b2`.
\end{split}
$$
Now let's interpret $x$-axis of Figure \@ref(fig:2score). Notice that \textbf{Price} and \textbf{Value} have almost identical weights. Thus $x$-axis describes mean of \text{Price} and \textbf{Value}. So one interpretation for $x$-axis could be a simple *value for money index*. Cars one the right have very good value for money index but cars on the left have a poor value for money index.
```{r}
# The x-axis reflects the average of Value and Price
sort(rowSums(xy[, 1:2]) / 2)
```
Now the scores $\phi_2$ reflect what kind of qualities cars with certain value for money index have. Weights for \textbf{Economy}, \textbf{Service} and \textbf{Easy h.} are negative, and weight for \textbf{Safety} is positive. On the other hand, weights for \textbf{Design} and \textbf{Sport} are negligible or very close to zero. Thus cars with good value for money index have good services, use little gas, are easy to handle but have maybe a bit worse safety than high-end cars. Below we show the raw numbers for an expensive car, a car with good value for money index and a cheap car.
```{r}
car[c(4, 22, 24), ]
```
All in all, one interpretation for $y$-axis could be *consumer-friendliness*.
```{r 2score, fig.cap="Scores corresponding to the second pair of canonical variables."}
plot(eta2, phi2, xlab = expression(paste(eta[2])),
ylab = expression(paste(phi[2])), pch = NA)
text(eta2, phi2, labels = paste(car$Type, car$Model))
```
# Problem 2: Discussion about course project
- Project work is compulsory and late submissions are not graded.
- See paragraph "About grading of the project work" in MyCourses Assignments tab. There you can find how the project is graded.
+ For example, If your project work is amazing in other parts but you do not include univariate analysis $\Rightarrow$ $5/6$p.
- You can use multiple methods that are learned on the course, however, it is suggested to use only one method for multivariate analysis.
- You can include code in the final pdf but it is not necessary.
- **No finding is a finding!**
- Some possible data sources are given below:
+ [\textcolor{blue}{Kaggle}](https://www.kaggle.com/)
+ [\textcolor{blue}{OECD}](https://www.oecd.org/)
+ [\textcolor{blue}{Statistics Finland}](https://www.stat.fi/index_en.html)
+ [\textcolor{blue}{Our World in Data}](https://ourworldindata.org/)
# Hint for homework 8
Ghost imaginary parts from eigenvectors can be removed with the function \texttt{as.numeric}.
```{r}
v1 <- c(1 + 0i, 2 + 0i, 3 + 0i)
class(v1)
v1
v2 <- as.numeric(v1)
class(v2)
v2
```