---
title: "Multivariate Statistical Analysis - Exercise Session 10"
date: "`r format(Sys.time(), '%d.%m.%Y')`"
output:
bookdown::pdf_document2:
extra_dependencies: ["float"]
latex_engine: xelatex
number_sections: no
toc: no
header-includes:
- \usepackage{amsmath}
- \usepackage{siunitx}
---
```{r, echo=FALSE}
# Figure placement
knitr::opts_chunk$set(fig.pos = "H", out.extra = "")
```
# Problem 1: Hierarchical Clustering
First we read the data and import required packages.
```{r, message=FALSE}
library(RColorBrewer)
library(reshape2)
library(dendextend)
poll <- read.table("polls.txt", header = TRUE, sep = "\t", row.names = 1)
dim(poll)
head(poll)
```
# a) Visualization of the data
It is quite hard to spot clusters from Figure \@ref(fig:scatterpoll).
```{r scatterpoll, fig.cap="Pairwise scatter plots of the variables. Each region is colored differently."}
pairs(poll, gap = 0, upper.panel = NULL, bg = brewer.pal(nrow(poll), "Set3"),
pch = 21, cex = 1.25)
```
# b) Euclidean distances between the regions
With function \texttt{dist} we can calculate distances between rows. Function returns a \texttt{dist} object.
```{r}
poll_dist <- dist(poll, method = "euclidean", diag = TRUE, upper = FALSE)
poll_dist
round(as.matrix(poll_dist), 1)
```
# c) Agglomerative hierarchical clustering with minimum distance by hand
Here we present one way to rank Euclidean distances between regions.
```{r}
d <- as.matrix(poll_dist)
d[upper.tri(d)] <- -1
d <- melt(d, varnames = c("Region1", "Region2"), value.name = "dist")
d <- d[d$dist > 0, ]
d <- d[order(d$dist), ]
d
```
Agglomerative hierarchical clustering is performed according to the following rules:
1. Start from the finest partition: $n$ clusters, each containing one data point $x_i$.
2. Calculate distances $d_{ij} = d(x_i,x_j)$, where $d$ is an appropriate distance between individuals. In this case we choose $d$ to be Euclidean distance, i.e, $d(x_i, x_j) = \sqrt{(x_i - x_j)^T (x_i - x_j)}$.
3. Find the minimal distance and group together the corresponding individuals.
4. Compute distances between the obtained groups. In this case $d(A,B) = \min_{x_i\in A,x_j\in B} d(x_i, x_j)$.
5. Find the minimal distance and group together the corresponding closest groups.
6. Repeat the steps 4 and 5 until you have one single group.
We refer to the regions by their row numbers, i.e,
$$
\begin{split}
&\text{Uusimaa} = 1, \\
&\text{Varsinais-Suomi} = 2, \\
&\text{Kanta-Hame} = 3, \\
&\text{Pirkanmaa} = 4, \\
&\text{Paijat-Hame} = 5, \\
&\text{Kymenlaakso} = 6, \\
&\text{Pohjois-Savo} = 7, \\
&\text{Pohjois-Karjala} = 8, \\
&\text{Pohjanmaa} = 9, \\
&\text{Kainuu} = 10, \\
&\text{Lappi} = 11.
\end{split}
$$
Now we are ready to perform clustering by hand with the help of distances \texttt{d}.
$$
\begin{split}
&(1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11) \\
&(5, 6), (1), (2), (3), (4), (7), (8), (9), (10), (11) \\
&(3, 5, 6), (1), (2), (4), (7), (8), (9), (10), (11) \\
&(3, 5, 6), (2, 4), (1), (7), (8), (9), (10), (11) \\
&(2, 3, 4, 5, 6) (1), (7), (8), (9), (10), (11) \\
&(2, 3, 4, 5, 6) (1), (7), (8), (9), (10, 11) \\
&(2, 3, 4, 5, 6), (7, 8), (9), (10, 11), (1) \\
&(1, 2, 3, 4, 5, 6), (7, 8), (9), (10, 11) \\
&(1, 2, 3, 4, 5, 6), (7, 8, 10, 11), (9) \\
&(1, 2, 3, 4, 5, 6, 7, 8, 10, 11), (9) \\
&(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) \\
\end{split}
$$
# d) Repeat part c) using the function \texttt{hclust}
Function \texttt{hclust} takes a \texttt{dist} object as an input.
```{r}
poll_clust <- hclust(poll_dist, method = "single")
```
# e) Plot the classification tree
```{r dendmin, fig.cap="Cluster dendogram."}
plot(poll_clust, xlab = "Region / Clusters",
ylab = "Tree height [Euclidean distance]", main = NA)
```
# f) Aggregating the clusters using different linkages
Figure \@ref(fig:linkage) shows that with different linkages we get different results.
```{r linkage, fig.cap="Dendograms with different linkages."}
par(mfrow = c(1, 3))
plot(hclust(poll_dist, method = "single"), main = "Single linkage",
xlab = "Region / Clusters", sub = NA, hang = -1)
plot(hclust(poll_dist, method = "average"), main = "Average linkage",
xlab = "Region / Clusters", sub = NA, hang = -1)
plot(hclust(poll_dist, method = "complete"), main = "Complete linkage",
xlab = "Region / Clusters", sub = NA, hang = -1)
```
# g) Where would you cut the tree?
The difficult part is to choose where to cut the tree, i.e., choose the clusters. This is typically an expert decision as it requires context knowledge. Figure \@ref(fig:dendmin10) shows the minimum distance tree again.
```{r dendmin10, fig.cap="Cluster dendogram."}
plot(poll_clust,
xlab = "Region / Clusters", ylab = "Tree height [Euclidean distance]",
hang = -1, main = NA, sub = NA)
abline(h = 10, lty = 2, col = 2, lwd = 2)
```
If we cut the tree at height 10, we obtain 5 clusters. The next step would be to annotate the clusters in a detailed fashion, i.e., to investigate what is the underlying reason the clusters tend to be similar with respect to the covariates used in the analysis. Function \texttt{cutree} can be used to cut the tree and obtain the clusters accordingly.
```{r}
cutree(poll_clust, h = 10)
```
Library \texttt{dendextend} provides more options for plotting dendograms. For example, we can color regions according to which cluster they belong.
```{r dendmincolor, fig.cap="Cluster dendogram where each region is colored according to which cluster they belong."}
dend <- as.dendrogram(poll_clust)
dend <- color_labels(dend, h = 10, col = brewer.pal(5, "Dark2"))
plot(dend)
```
# Problem 2: $k$-means clustering
```{r}
bank <- read.table("BANK.txt", header = TRUE, sep = "\t")
dim(bank)
str(bank)
head(bank)
```
For practical reasons we change the data type of column \texttt{CODE} to a factor.
```{r}
bank$CODE <- factor(ifelse(bank$CODE == 0, "C1", "C2"))
```
# a) Apply the $k$-means algorithm to obtain two clusters
The $k$-means clustering algorithm has a random part so to be able to reproduce your results, reset the random generator seed number. We can peform $k$-means clustering with the function \texttt{kmeans} that takes the data as an input.
```{r}
set.seed(19032020)
bank_clust <- kmeans(bank[, -1], centers = 2)
bank_clust
```
In two-dimensional case, the results can be easily illustrated with a scatter plot. Figure \@ref(fig:bankscatter) shows the clustering results.
```{r bankscatter, fig.cap="Scatter plot of clustering results. Color represents the clusters and large points represent the centers used by the algorithm."}
plot(bank[, -1], pch = 16, col = c("blue", "red")[bank_clust$cluster])
# Cluster centers
points(bank_clust$centers[, 1], bank_clust$centers[, 2], pch = 21,
bg = c("blue", "red"), cex = 2)
# You can also add true labels but this makes the image hard to read
# text(bank[, -1], labels = bank$CODE, pos = 3)
```
# b) How many observations are classified to a wrong category?
Simple way to investigate this is to calculate the confusion matrix.
```{r}
table(bank_clust$cluster, bank[, 1])
```
By the confusion matrix all observations in class \texttt{C1} are correctly classified. On the other hand, five observations of class \texttt{C2} are wrongly classified to the class \texttt{C1}. Thus $2.5\%$ of observations are wrongly classified.
```{r}
5 / nrow(bank)
```
# c) Changing the seed number with $k = 2$
You can run the $k$-means clustering algorithm multiple times and see how the results change. In this case only the cluster number which \texttt{C1} and \texttt{C2} are associated changes but the proportions do not. Overall, as the initial centroids of $k$-means are randomly selected, the random generator seed number changes the results. However, it should not change the results dramatically if there truly exists a structure in the data, which can be detected by the $k$-means algorithm.
```{r}
# Perform k-means clustering 100 times
bank_clust100 <- replicate(100, kmeans(bank[, -1], centers = 2)$cluster)
# Calculate the cluster proportions
apply(bank_clust100, 2, function(x) table(x) / nrow(bank))
```
# d) Changing the seed number with $k = 3$
Figure \@ref(fig:bankscatter3) shows clustering results for $k = 3$.
```{r bankscatter3, fig.cap="Scatter plot of clustering results with $k = 3$. Color represents the clusters and large points represent the centers used by the algorithm."}
bank_clust3 <- kmeans(bank[, -1], centers = 3)
plot(bank[, -1], pch = 16,
col = c("blue", "red", "yellow")[bank_clust3$cluster])
# Cluster centers
points(bank_clust3$centers[, 1], bank_clust3$centers[, 2], pch = 21,
bg = c("blue", "red", "yellow"), cex = 2)
```
As we can see from below snippet of code, proportions of clusters change depending on the seed number when $k = 3$. Therefore, results for $k$-means clustering are not as stable when we use three clusters instead of two.
```{r}
# Perform k-means clustering 100 times
bank_clust100 <- replicate(100, kmeans(bank[, -1], centers = 3)$cluster)
# Calculate the cluster proportions
apply(bank_clust100, 2, function(x) table(x) / nrow(bank))
```
Remark that typically in the clustering problem the labels are unknown. Therefore, it is essential to use expert validation / analyze the obtained clusters carefully. As we can see here, the clustering algorithm will find as many clusters as we tell it to find!