---
title: "Multivariate Statistical Analysis - Exercise Session 5"
date: "`r format(Sys.time(), '%d.%m.%Y')`"
output:
  bookdown::pdf_document2:
    latex_engine: xelatex
    number_sections: no
    toc: no
header-includes:
  - \usepackage{amsmath}
---

\newcommand{\cov}{\mathrm{Cov}}
\newcommand{\e}{\mathbb{E}}

# Problem 1: Frequency analysis

First we introduce some notation. Consider a sample of size $n$ described by two variables, $x$ with categories $A_1, \ldots, A_J$ and $y$ with categories $B_1, \ldots, B_K$. Let $n_{jk}$ denote the number of observations having categories $A_j$ and $B_j$. Let
$$
f_{jk} = \frac{n_{jk}}{n}
$$

denote the *relative frequencies* corresponding to $n_{jk}$. Additionally, we denote
$$
  n_{j.} = \sum_{k=1}^K n_{jk}, \quad n_{.k} = \sum_{j=1}^J n_{jk},
$$
and similarly,
$$
  f_{j.} = \sum_{k=1}^K f_{jk}, \quad f_{.k} = \sum_{j=1}^J f_{jk}.
$$

First we read the data and import required packages.
```{r}
library(ggplot2)
library(reshape2)

data <- read.table("SCIENCEDOCTORATES.txt", header = TRUE, row.names = 1)
```

Below we show the *contigency table* of the data. For further analysis let us remove \texttt{Total} row and \texttt{Total} column. With command \texttt{proportions} we can calculate relative frequencies, row profiles and column profiles. Also, function \texttt{proportions} requires matrix as an input.
```{r}
data
data <- as.matrix(data[-13, -9]) # Remove total row and total column
```

Below we calculate the *relative frequencies*.
```{r}
f <- proportions(data)
all(data / sum(data) == f)
round(f, 2)
```

Next we calculate the *row profiles*. They are given by
$$
  f_{k|j} = \frac{n_{jk}}{n_{j.}} = \frac{f_{jk}}{f_{j.}}.
$$
Frequency $f_{k|j}$ estimates the probability
$$
P(y \in B_k | x \in A_j).
$$

```{r}
rowp <- proportions(data, 1)
all(rowp == sweep(data, 1, rowSums(data), "/"))
round(rowp, 2)
```

*Column profiles* can be calculated similarly to row profiles. They are given by

$$
  f_{j|k} = \frac{n_{jk}}{n_{.k}} = \frac{f_{jk}}{f_{.k}}.
$$
Frequency $f_{j|k}$ estimates the probability
$$
P(x \in A_j | y \in B_k).
$$
```{r}
colp <- proportions(data, 2)
all(colp == sweep(data, 2, colSums(data), "/"))
round(colp, 2)
```

Now let us calculate the *attraction repulsion matrix* $D$. Elements of the matrix are given by
$$
  d_{jk} = \frac{n_{jk}}{n_{jk}^*} = \frac{f_{jk}}{f_{jk}^*},
$$
where $n_{jk}^* = \frac{n_{j.} n_{.k}}{n}$ and $f_{jk}^* = f_{j.} f_{.k}$.

```{r}
# v1 and v2 are same as total row and total column
v1 <- matrix(rowSums(data), ncol = 1)
v2 <- matrix(colSums(data), nrow = 1)

# Expected number of observations under independence
e <- (v1 %*% v2) / sum(data) # n_{jk}^{*} from above

# We obtain attraction repulsion matrix D
# simply dividing each n_{jk} by n_{jk}^{*}
d <- data / e

# Values near 1: The year and science are independent
# Values < 1: The science is less frequent in that specific year
# Values > 1: The science is more frequent in that specific year
round(d, 2)
```

Next we visualize attraction repulsion matrix as a heatmap. First we transform the data in appropriate form for the \texttt{ggplot2} with the \texttt{melt} function from the package \texttt{reshape2}.
```{r}
melted <- melt(d, varnames = c("science", "year"), value.name = "ar")
head(melted)
```

Now we are ready to plot the heatmap of attraction repulsion matrix with \texttt{ggplot2}.

```{r heatmap, fig.cap="Heatmap of the attraction repulsion matrix."}
ggplot(melted, aes(x = science, y = year, fill = ar)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = median(melted$ar), limits = range(melted$ar),
                       name = "AR value") +
  coord_fixed(ratio = 1) +
  labs(x = "Science", y = "Year") +
  theme(axis.text.x = element_text(angle = 45, size = 9, vjust = 1, hjust = 1),
        panel.background = element_blank())
```


# Problem 2: Covariance matrix

Let $x$ be a $p$-variate continuous random variable. Show that $\cov[x]$ is positive semi-definite.

Recall the definition of positive semi-definitess:

::: {.definition}
A symmetric and real-valued $p \times p$ matrix $A$ is said to be positive semi-definite if the scalar $a^T A a$ is nonnegative for every real-valued column vector $a \in \mathbb{R}^p$.
:::

Now,
\begin{equation*}
  \begin{split}
    &a^T \cov[x] a \\
    &= a^T \e\left[\left(x - \e[x]\right)\left(x - \e[x]\right)^T\right] a \\
    &= \e\left[ \underbrace{a^T\left(x - \e[x]\right)}_{=y \in\mathbb{R}}
    \left(x - \e[x]\right)^T a \right] \\
    &= \e\left[yy^T\right] = \e[y^2] \geq 0.
  \end{split}
\end{equation*}