--- title: "Multivariate Statistical Analysis - Exercise Session 5" date: "`r format(Sys.time(), '%d.%m.%Y')`" output: bookdown::pdf_document2: latex_engine: xelatex number_sections: no toc: no header-includes: - \usepackage{amsmath} --- \newcommand{\cov}{\mathrm{Cov}} \newcommand{\e}{\mathbb{E}} # Problem 1: Frequency analysis First we introduce some notation. Consider a sample of size $n$ described by two variables, $x$ with categories $A_1, \ldots, A_J$ and $y$ with categories $B_1, \ldots, B_K$. Let $n_{jk}$ denote the number of observations having categories $A_j$ and $B_j$. Let $$ f_{jk} = \frac{n_{jk}}{n} $$ denote the *relative frequencies* corresponding to $n_{jk}$. Additionally, we denote $$ n_{j.} = \sum_{k=1}^K n_{jk}, \quad n_{.k} = \sum_{j=1}^J n_{jk}, $$ and similarly, $$ f_{j.} = \sum_{k=1}^K f_{jk}, \quad f_{.k} = \sum_{j=1}^J f_{jk}. $$ First we read the data and import required packages. ```{r} library(ggplot2) library(reshape2) data <- read.table("SCIENCEDOCTORATES.txt", header = TRUE, row.names = 1) ``` Below we show the *contigency table* of the data. For further analysis let us remove \texttt{Total} row and \texttt{Total} column. With command \texttt{proportions} we can calculate relative frequencies, row profiles and column profiles. Also, function \texttt{proportions} requires matrix as an input. ```{r} data data <- as.matrix(data[-13, -9]) # Remove total row and total column ``` Below we calculate the *relative frequencies*. ```{r} f <- proportions(data) all(data / sum(data) == f) round(f, 2) ``` Next we calculate the *row profiles*. They are given by $$ f_{k|j} = \frac{n_{jk}}{n_{j.}} = \frac{f_{jk}}{f_{j.}}. $$ Frequency $f_{k|j}$ estimates the probability $$ P(y \in B_k | x \in A_j). $$ ```{r} rowp <- proportions(data, 1) all(rowp == sweep(data, 1, rowSums(data), "/")) round(rowp, 2) ``` *Column profiles* can be calculated similarly to row profiles. They are given by $$ f_{j|k} = \frac{n_{jk}}{n_{.k}} = \frac{f_{jk}}{f_{.k}}. $$ Frequency $f_{j|k}$ estimates the probability $$ P(x \in A_j | y \in B_k). $$ ```{r} colp <- proportions(data, 2) all(colp == sweep(data, 2, colSums(data), "/")) round(colp, 2) ``` Now let us calculate the *attraction repulsion matrix* $D$. Elements of the matrix are given by $$ d_{jk} = \frac{n_{jk}}{n_{jk}^*} = \frac{f_{jk}}{f_{jk}^*}, $$ where $n_{jk}^* = \frac{n_{j.} n_{.k}}{n}$ and $f_{jk}^* = f_{j.} f_{.k}$. ```{r} # v1 and v2 are same as total row and total column v1 <- matrix(rowSums(data), ncol = 1) v2 <- matrix(colSums(data), nrow = 1) # Expected number of observations under independence e <- (v1 %*% v2) / sum(data) # n_{jk}^{*} from above # We obtain attraction repulsion matrix D # simply dividing each n_{jk} by n_{jk}^{*} d <- data / e # Values near 1: The year and science are independent # Values < 1: The science is less frequent in that specific year # Values > 1: The science is more frequent in that specific year round(d, 2) ``` Next we visualize attraction repulsion matrix as a heatmap. First we transform the data in appropriate form for the \texttt{ggplot2} with the \texttt{melt} function from the package \texttt{reshape2}. ```{r} melted <- melt(d, varnames = c("science", "year"), value.name = "ar") head(melted) ``` Now we are ready to plot the heatmap of attraction repulsion matrix with \texttt{ggplot2}. ```{r heatmap, fig.cap="Heatmap of the attraction repulsion matrix."} ggplot(melted, aes(x = science, y = year, fill = ar)) + geom_tile() + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = median(melted$ar), limits = range(melted$ar), name = "AR value") + coord_fixed(ratio = 1) + labs(x = "Science", y = "Year") + theme(axis.text.x = element_text(angle = 45, size = 9, vjust = 1, hjust = 1), panel.background = element_blank()) ``` # Problem 2: Covariance matrix Let $x$ be a $p$-variate continuous random variable. Show that $\cov[x]$ is positive semi-definite. Recall the definition of positive semi-definitess: ::: {.definition} A symmetric and real-valued $p \times p$ matrix $A$ is said to be positive semi-definite if the scalar $a^T A a$ is nonnegative for every real-valued column vector $a \in \mathbb{R}^p$. ::: Now, \begin{equation*} \begin{split} &a^T \cov[x] a \\ &= a^T \e\left[\left(x - \e[x]\right)\left(x - \e[x]\right)^T\right] a \\ &= \e\left[ \underbrace{a^T\left(x - \e[x]\right)}_{=y \in\mathbb{R}} \left(x - \e[x]\right)^T a \right] \\ &= \e\left[yy^T\right] = \e[y^2] \geq 0. \end{split} \end{equation*}