Problem 1: Hierarchical Clustering

The data set polls.txt contains voting data of 11 different regions from the 2017 municipal elections. The variables are the percentages of votes each of Finland’s eight largest political parties gained in the elections. We will use clustering methods to see which regions resemble each other most closely.

**Read the data:**

Later, we will be using functions melt and hist3D which require imports.

# install.packages("reshape2")
# install.packages("plot3D")
library(reshape2)
library(plot3D) 

D <- read.csv('polls.txt', header = T, sep = '\t', row.names = 1)
D
##                  KOK  SDP KESK VIHR   PS  VAS  RKP  KD
## Uusimaa         26.3 17.0  5.4 19.7  8.3  8.0  8.4 3.1
## Varsinais-Suomi 23.7 19.3 14.0 13.2  7.8 11.1  4.9 3.0
## Kanta-Hame      23.2 25.8 16.4  9.4  8.9 10.1  0.0 5.3
## Pirkanmaa       22.5 22.9 11.5 14.8  8.3  9.8  0.2 4.8
## Paijat-Hame     23.6 25.7 11.8  9.3 10.2  5.0  0.0 5.9
## Kymenlaakso     22.1 26.4 14.3 10.0 10.5  6.4  0.7 5.1
## Pohjois-Savo    16.2 19.1 29.7  8.0 10.1 10.2  0.0 5.7
## Pohjois-Karjala 12.4 26.2 30.3  7.6 11.4  6.5  0.0 4.7
## Pohjanmaa        8.1 15.0  5.5  3.3  5.0  4.2 48.1 7.0
## Kainuu          11.9  9.8 39.1  5.9 10.7 18.8  0.0 3.2
## Lappi           13.8 15.0 36.0  4.3  6.8 17.9  0.2 1.6



(a) Scatter plot the variables. Can you spot the different clusters?

region.colors <- colorRampPalette(colors = c('red','blue'))(nrow(D))
pairs(D, upper.panel = NULL,
      pch = 16, col = region.colors, cex = 1.25)



(b) Calculate the Euclidean distances between the regions

D.dist <- dist(D, method = 'euclidean', diag = T, upper = F)
round(as.matrix(D.dist), 1)
##                 Uusimaa Varsinais-Suomi Kanta-Hame Pirkanmaa Paijat-Hame
## Uusimaa             0.0            12.3       19.9      13.6        18.0
## Varsinais-Suomi    12.3             0.0        9.7       7.1        11.7
## Kanta-Hame         19.9             9.7        0.0       7.9         7.0
## Pirkanmaa          13.6             7.1        7.9       0.0         8.2
## Paijat-Hame        18.0            11.7        7.0       8.2         0.0
## Kymenlaakso        18.7            10.7        4.8       7.7         3.6
## Pohjois-Savo       30.3            19.2       16.6      20.9        21.2
## Pohjois-Karjala    33.6            23.1       18.2      23.2        21.8
## Pohjanmaa          47.1            48.7       53.5      52.7        52.7
## Kainuu             42.2            31.7       31.5      34.8        36.7
## Lappi              38.8            27.4       26.3      30.4        31.9
##                 Kymenlaakso Pohjois-Savo Pohjois-Karjala Pohjanmaa Kainuu Lappi
## Uusimaa                18.7         30.3            33.6      47.1   42.2  38.8
## Varsinais-Suomi        10.7         19.2            23.1      48.7   31.7  27.4
## Kanta-Hame              4.8         16.6            18.2      53.5   31.5  26.3
## Pirkanmaa               7.7         20.9            23.2      52.7   34.8  30.4
## Paijat-Hame             3.6         21.2            21.8      52.7   36.7  31.9
## Kymenlaakso             0.0         18.6            18.9      52.3   34.2  29.3
## Pohjois-Savo           18.6          0.0             9.0      55.4   16.7  12.8
## Pohjois-Karjala        18.9          9.0             0.0      56.1   22.4  18.2
## Pohjanmaa              52.3         55.4            56.1       0.0   61.2  59.0
## Kainuu                 34.2         16.7            22.4      61.2    0.0   7.8
## Lappi                  29.3         12.8            18.2      59.0    7.8   0.0



(c) Perform the “bottom up” hierarchical clustering by hand. Aggregate two clusters using the minimum distance (single linkage).

This is one way to obtain the and rank the Euclidean distances (not necessarily the cleanest). First, convert the distance matrix to a matrix:

D.dist.matrix <- as.matrix(D.dist)
round(D.dist.matrix, 1)
##                 Uusimaa Varsinais-Suomi Kanta-Hame Pirkanmaa Paijat-Hame
## Uusimaa             0.0            12.3       19.9      13.6        18.0
## Varsinais-Suomi    12.3             0.0        9.7       7.1        11.7
## Kanta-Hame         19.9             9.7        0.0       7.9         7.0
## Pirkanmaa          13.6             7.1        7.9       0.0         8.2
## Paijat-Hame        18.0            11.7        7.0       8.2         0.0
## Kymenlaakso        18.7            10.7        4.8       7.7         3.6
## Pohjois-Savo       30.3            19.2       16.6      20.9        21.2
## Pohjois-Karjala    33.6            23.1       18.2      23.2        21.8
## Pohjanmaa          47.1            48.7       53.5      52.7        52.7
## Kainuu             42.2            31.7       31.5      34.8        36.7
## Lappi              38.8            27.4       26.3      30.4        31.9
##                 Kymenlaakso Pohjois-Savo Pohjois-Karjala Pohjanmaa Kainuu Lappi
## Uusimaa                18.7         30.3            33.6      47.1   42.2  38.8
## Varsinais-Suomi        10.7         19.2            23.1      48.7   31.7  27.4
## Kanta-Hame              4.8         16.6            18.2      53.5   31.5  26.3
## Pirkanmaa               7.7         20.9            23.2      52.7   34.8  30.4
## Paijat-Hame             3.6         21.2            21.8      52.7   36.7  31.9
## Kymenlaakso             0.0         18.6            18.9      52.3   34.2  29.3
## Pohjois-Savo           18.6          0.0             9.0      55.4   16.7  12.8
## Pohjois-Karjala        18.9          9.0             0.0      56.1   22.4  18.2
## Pohjanmaa              52.3         55.4            56.1       0.0   61.2  59.0
## Kainuu                 34.2         16.7            22.4      61.2    0.0   7.8
## Lappi                  29.3         12.8            18.2      59.0    7.8   0.0

As the distance \(N \times N\) matrix is symmetric, we can neglect the upper triangle (change the upper triangle elements into NA):

D.dist.matrix[upper.tri(D.dist.matrix)] <- NA
round(D.dist.matrix, 1)
##                 Uusimaa Varsinais-Suomi Kanta-Hame Pirkanmaa Paijat-Hame
## Uusimaa             0.0              NA         NA        NA          NA
## Varsinais-Suomi    12.3             0.0         NA        NA          NA
## Kanta-Hame         19.9             9.7        0.0        NA          NA
## Pirkanmaa          13.6             7.1        7.9       0.0          NA
## Paijat-Hame        18.0            11.7        7.0       8.2         0.0
## Kymenlaakso        18.7            10.7        4.8       7.7         3.6
## Pohjois-Savo       30.3            19.2       16.6      20.9        21.2
## Pohjois-Karjala    33.6            23.1       18.2      23.2        21.8
## Pohjanmaa          47.1            48.7       53.5      52.7        52.7
## Kainuu             42.2            31.7       31.5      34.8        36.7
## Lappi              38.8            27.4       26.3      30.4        31.9
##                 Kymenlaakso Pohjois-Savo Pohjois-Karjala Pohjanmaa Kainuu Lappi
## Uusimaa                  NA           NA              NA        NA     NA    NA
## Varsinais-Suomi          NA           NA              NA        NA     NA    NA
## Kanta-Hame               NA           NA              NA        NA     NA    NA
## Pirkanmaa                NA           NA              NA        NA     NA    NA
## Paijat-Hame              NA           NA              NA        NA     NA    NA
## Kymenlaakso             0.0           NA              NA        NA     NA    NA
## Pohjois-Savo           18.6          0.0              NA        NA     NA    NA
## Pohjois-Karjala        18.9          9.0             0.0        NA     NA    NA
## Pohjanmaa              52.3         55.4            56.1       0.0     NA    NA
## Kainuu                 34.2         16.7            22.4      61.2    0.0    NA
## Lappi                  29.3         12.8            18.2      59.0    7.8     0

We can reformulate our \(N \times N\) matrix to a "stacked matrix" so it is easier to analyze:

D.dist.melt <- melt(D.dist.matrix)
head(D.dist.melt, 20) # first 20 rows
##               Var1            Var2     value
## 1          Uusimaa         Uusimaa  0.000000
## 2  Varsinais-Suomi         Uusimaa 12.262952
## 3       Kanta-Hame         Uusimaa 19.857240
## 4        Pirkanmaa         Uusimaa 13.558761
## 5      Paijat-Hame         Uusimaa 17.975261
## 6      Kymenlaakso         Uusimaa 18.708020
## 7     Pohjois-Savo         Uusimaa 30.318311
## 8  Pohjois-Karjala         Uusimaa 33.604315
## 9        Pohjanmaa         Uusimaa 47.125789
## 10          Kainuu         Uusimaa 42.169894
## 11           Lappi         Uusimaa 38.775250
## 12         Uusimaa Varsinais-Suomi        NA
## 13 Varsinais-Suomi Varsinais-Suomi  0.000000
## 14      Kanta-Hame Varsinais-Suomi  9.706184
## 15       Pirkanmaa Varsinais-Suomi  7.104928
## 16     Paijat-Hame Varsinais-Suomi 11.679469
## 17     Kymenlaakso Varsinais-Suomi 10.711209
## 18    Pohjois-Savo Varsinais-Suomi 19.162985
## 19 Pohjois-Karjala Varsinais-Suomi 23.094805
## 20       Pohjanmaa Varsinais-Suomi 48.680592

Finally, sort the euclidean distances from smallest to largest:

D.dist.melt <- D.dist.melt[order(D.dist.melt$value),]
head(D.dist.melt, 20)
##                Var1            Var2    value
## 1           Uusimaa         Uusimaa 0.000000
## 13  Varsinais-Suomi Varsinais-Suomi 0.000000
## 25       Kanta-Hame      Kanta-Hame 0.000000
## 37        Pirkanmaa       Pirkanmaa 0.000000
## 49      Paijat-Hame     Paijat-Hame 0.000000
## 61      Kymenlaakso     Kymenlaakso 0.000000
## 73     Pohjois-Savo    Pohjois-Savo 0.000000
## 85  Pohjois-Karjala Pohjois-Karjala 0.000000
## 97        Pohjanmaa       Pohjanmaa 0.000000
## 109          Kainuu          Kainuu 0.000000
## 121           Lappi           Lappi 0.000000
## 50      Kymenlaakso     Paijat-Hame 3.558089
## 28      Kymenlaakso      Kanta-Hame 4.808326
## 27      Paijat-Hame      Kanta-Hame 7.028513
## 15        Pirkanmaa Varsinais-Suomi 7.104928
## 39      Kymenlaakso       Pirkanmaa 7.747903
## 110           Lappi          Kainuu 7.838367
## 26        Pirkanmaa      Kanta-Hame 7.925276
## 38      Paijat-Hame       Pirkanmaa 8.203048
## 74  Pohjois-Karjala    Pohjois-Savo 9.041571
# Look at the data and use the algortihm for hierarchical clustering using the
# minimum distance:
D.dist.melt
##                Var1            Var2     value
## 1           Uusimaa         Uusimaa  0.000000
## 13  Varsinais-Suomi Varsinais-Suomi  0.000000
## 25       Kanta-Hame      Kanta-Hame  0.000000
## 37        Pirkanmaa       Pirkanmaa  0.000000
## 49      Paijat-Hame     Paijat-Hame  0.000000
## 61      Kymenlaakso     Kymenlaakso  0.000000
## 73     Pohjois-Savo    Pohjois-Savo  0.000000
## 85  Pohjois-Karjala Pohjois-Karjala  0.000000
## 97        Pohjanmaa       Pohjanmaa  0.000000
## 109          Kainuu          Kainuu  0.000000
## 121           Lappi           Lappi  0.000000
## 50      Kymenlaakso     Paijat-Hame  3.558089
## 28      Kymenlaakso      Kanta-Hame  4.808326
## 27      Paijat-Hame      Kanta-Hame  7.028513
## 15        Pirkanmaa Varsinais-Suomi  7.104928
## 39      Kymenlaakso       Pirkanmaa  7.747903
## 110           Lappi          Kainuu  7.838367
## 26        Pirkanmaa      Kanta-Hame  7.925276
## 38      Paijat-Hame       Pirkanmaa  8.203048
## 74  Pohjois-Karjala    Pohjois-Savo  9.041571
## 14       Kanta-Hame Varsinais-Suomi  9.706184
## 17      Kymenlaakso Varsinais-Suomi 10.711209
## 16      Paijat-Hame Varsinais-Suomi 11.679469
## 2   Varsinais-Suomi         Uusimaa 12.262952
## 77            Lappi    Pohjois-Savo 12.766362
## 4         Pirkanmaa         Uusimaa 13.558761
## 29     Pohjois-Savo      Kanta-Hame 16.563514
## 76           Kainuu    Pohjois-Savo 16.682925
## 5       Paijat-Hame         Uusimaa 17.975261
## 88            Lappi Pohjois-Karjala 18.208514
## 30  Pohjois-Karjala      Kanta-Hame 18.243355
## 62     Pohjois-Savo     Kymenlaakso 18.566367
## 6       Kymenlaakso         Uusimaa 18.708020
## 63  Pohjois-Karjala     Kymenlaakso 18.903968
## 18     Pohjois-Savo Varsinais-Suomi 19.162985
## 3        Kanta-Hame         Uusimaa 19.857240
## 40     Pohjois-Savo       Pirkanmaa 20.877260
## 51     Pohjois-Savo     Paijat-Hame 21.154432
## 52  Pohjois-Karjala     Paijat-Hame 21.816508
## 87           Kainuu Pohjois-Karjala 22.440365
## 19  Pohjois-Karjala Varsinais-Suomi 23.094805
## 41  Pohjois-Karjala       Pirkanmaa 23.210558
## 33            Lappi      Kanta-Hame 26.346727
## 22            Lappi Varsinais-Suomi 27.404379
## 66            Lappi     Kymenlaakso 29.337178
## 7      Pohjois-Savo         Uusimaa 30.318311
## 44            Lappi       Pirkanmaa 30.441748
## 32           Kainuu      Kanta-Hame 31.536804
## 21           Kainuu Varsinais-Suomi 31.694479
## 55            Lappi     Paijat-Hame 31.900940
## 8   Pohjois-Karjala         Uusimaa 33.604315
## 65           Kainuu     Kymenlaakso 34.195760
## 43           Kainuu       Pirkanmaa 34.846808
## 54           Kainuu     Paijat-Hame 36.667833
## 11            Lappi         Uusimaa 38.775250
## 10           Kainuu         Uusimaa 42.169894
## 9         Pohjanmaa         Uusimaa 47.125789
## 20        Pohjanmaa Varsinais-Suomi 48.680592
## 64        Pohjanmaa     Kymenlaakso 52.285275
## 53        Pohjanmaa     Paijat-Hame 52.658618
## 42        Pohjanmaa       Pirkanmaa 52.721153
## 31        Pohjanmaa      Kanta-Hame 53.546148
## 75        Pohjanmaa    Pohjois-Savo 55.384655
## 86        Pohjanmaa Pohjois-Karjala 56.058987
## 99            Lappi       Pohjanmaa 58.976606
## 98           Kainuu       Pohjanmaa 61.244592
## 12          Uusimaa Varsinais-Suomi        NA
## 23          Uusimaa      Kanta-Hame        NA
## 24  Varsinais-Suomi      Kanta-Hame        NA
## 34          Uusimaa       Pirkanmaa        NA
## 35  Varsinais-Suomi       Pirkanmaa        NA
## 36       Kanta-Hame       Pirkanmaa        NA
## 45          Uusimaa     Paijat-Hame        NA
## 46  Varsinais-Suomi     Paijat-Hame        NA
## 47       Kanta-Hame     Paijat-Hame        NA
## 48        Pirkanmaa     Paijat-Hame        NA
## 56          Uusimaa     Kymenlaakso        NA
## 57  Varsinais-Suomi     Kymenlaakso        NA
## 58       Kanta-Hame     Kymenlaakso        NA
## 59        Pirkanmaa     Kymenlaakso        NA
## 60      Paijat-Hame     Kymenlaakso        NA
## 67          Uusimaa    Pohjois-Savo        NA
## 68  Varsinais-Suomi    Pohjois-Savo        NA
## 69       Kanta-Hame    Pohjois-Savo        NA
## 70        Pirkanmaa    Pohjois-Savo        NA
## 71      Paijat-Hame    Pohjois-Savo        NA
## 72      Kymenlaakso    Pohjois-Savo        NA
## 78          Uusimaa Pohjois-Karjala        NA
## 79  Varsinais-Suomi Pohjois-Karjala        NA
## 80       Kanta-Hame Pohjois-Karjala        NA
## 81        Pirkanmaa Pohjois-Karjala        NA
## 82      Paijat-Hame Pohjois-Karjala        NA
## 83      Kymenlaakso Pohjois-Karjala        NA
## 84     Pohjois-Savo Pohjois-Karjala        NA
## 89          Uusimaa       Pohjanmaa        NA
## 90  Varsinais-Suomi       Pohjanmaa        NA
## 91       Kanta-Hame       Pohjanmaa        NA
## 92        Pirkanmaa       Pohjanmaa        NA
## 93      Paijat-Hame       Pohjanmaa        NA
## 94      Kymenlaakso       Pohjanmaa        NA
## 95     Pohjois-Savo       Pohjanmaa        NA
## 96  Pohjois-Karjala       Pohjanmaa        NA
## 100         Uusimaa          Kainuu        NA
## 101 Varsinais-Suomi          Kainuu        NA
## 102      Kanta-Hame          Kainuu        NA
## 103       Pirkanmaa          Kainuu        NA
## 104     Paijat-Hame          Kainuu        NA
## 105     Kymenlaakso          Kainuu        NA
## 106    Pohjois-Savo          Kainuu        NA
## 107 Pohjois-Karjala          Kainuu        NA
## 108       Pohjanmaa          Kainuu        NA
## 111         Uusimaa           Lappi        NA
## 112 Varsinais-Suomi           Lappi        NA
## 113      Kanta-Hame           Lappi        NA
## 114       Pirkanmaa           Lappi        NA
## 115     Paijat-Hame           Lappi        NA
## 116     Kymenlaakso           Lappi        NA
## 117    Pohjois-Savo           Lappi        NA
## 118 Pohjois-Karjala           Lappi        NA
## 119       Pohjanmaa           Lappi        NA
## 120          Kainuu           Lappi        NA



In hierarchical clustering each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Below, the numbers in parenthesis denote the euclidean distances between the observations.

1. Start from the finest partition: n clusters, each containing one data point x_i
2. Calculate distances d_ij = d(x_i,x_j), where d is an appropriate distance between individuals. (Such as Euclidean distance; this depends on the context and the distance measure may have a significant effect!)
3. Find the minimal distance and group together the corresponding individuals.
4. Compute distances between the obtained groups.
5. Find the minimal distance and group together the corresponding closest groups.
6. Repeat the steps 4 and 5 until you have one single group.

Kymenlaakso and Paijat-Hame have the smallest distance and, therefore, form the first first cluster. Looking at the distances we see that next smallest distance is between Kymenlaakso and Kanta-Hame. Therefore, Kanta-Hame is attached to the first cluster. Looking again at the distances we find that the next pair is Paijat-Hame and Kanta-Hame which are all included in the first cluster. While moving down the list we notice that the nextpair is Pirkanmaa and Varsinais-Suomi and, therefore, form the second cluster. The third cluster is Lappi and Kainuu. The fourth is Pohjois-Savo and Pohjois-Karjala.
#(Paijat-Hame, Kymenlaakso) 3.6 #(Kanta-Hame, Paijat-Hame, Kymenlaakso) 4.8 #(Kanta-Hame, Paijat-Hame, Kymenlaakso), (Varsinais-Suomi, Pirkanmaa) 7.1 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa) 7.7 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa), (Lappi, Kainuu) 7.8 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa), (Lappi, Kainuu), (Pohjois-Savo, Pohjois-Karjala) 9.0 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa, Uusimaa), (Lappi, Kainuu), (Pohjois-Savo, Pohjois-Karjala) 12.3 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa, Uusimaa), (Lappi, Kainuu, Pohjois-Savo, Pohjois-Karjala) 12.8 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa, Uusimaa, Lappi, Kainuu, Pohjois-Savo, Pohjois-Karjala) 16.6 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa, Uusimaa, Lappi, Kainuu, Pohjois-Savo, Pohjois-Karjala, Pohjanmaa) 47.1

d) Repeat (c) using the function hclust()
The first argument is a distance object, such as euclidean distance. The argument 'method' controls the agglomerative method to be used in the fitting. Here, 'single' is almost the same as the minimun distance, single-linkage method we used manually above.

D.hierarchical.clustering <- hclust(D.dist, method = 'single')
D.hierarchical.clustering
## 
## Call:
## hclust(d = D.dist, method = "single")
## 
## Cluster method   : single 
## Distance         : euclidean 
## Number of objects: 11

Plotting

plot(D.hierarchical.clustering,
     xlab = 'Region / Clusters', ylab = 'Tree height [Euclidean distance]')

Does the results match the one we did manually?

plot(D.hierarchical.clustering,
     xlab = 'Region / Clusters', ylab = 'Tree height [Euclidean distance]')
abline(h = c(3.55, 7.10, 7.83, 9.04, 4.80, 7.74, 12.26, 12.76, 16.56, 47.12),
       lty = 3, col = 2)

(f) Repeat the steps by aggregating the clusters using the average (average linkage) and the maximum (complete linkage). Compare the results.
Here the idea is to compare different agglomerative algortihms, which you can select with the 'method' argument in the 'hclust' function.
Let's plot differently algorithms side by side:

par(mfrow = c(1,3))
plot(hclust(D.dist, method = 'single'))
plot(hclust(D.dist, method = 'average'))
plot(hclust(D.dist, method = 'complete'))

(g) Where would you cut the tree?
So, growing the tree is fairly simple and quick task. The difficult part is to choose where to cut the tree, i.e., choose the clusters. This is typically an expert decision as it requires context knowledge.

Let's plot our minimum distance tree again.

plot(D.hierarchical.clustering,
     xlab = 'Region / Clusters', ylab = 'Tree height [Euclidean distance]')
abline(h = 10,
       lty = 1, col = 2)

If we use the cut-point, say 10 (we cut the tree at 'height' 10), we obtain 5 clusters.The next step would be to annotate the clusters in a detailed fashion, i.e., to investigate what is the underlying reason the clusters tend to be similar with respect to the covariates used in the analysis. Function 'cutree' can be used to cut the tree and obtain the clusters accordinly.

cutree(D.hierarchical.clustering,
       h = 10)
##         Uusimaa Varsinais-Suomi      Kanta-Hame       Pirkanmaa     Paijat-Hame 
##               1               2               2               2               2 
##     Kymenlaakso    Pohjois-Savo Pohjois-Karjala       Pohjanmaa          Kainuu 
##               2               3               3               4               5 
##           Lappi 
##               5

Problem 2: k-Means clustering

Use the data BANK.txt. The first column contains the true class of the observations.

Read the data

D2 <- read.csv('BANK.txt', sep = '\t', header = T)

Take a look at the data type:

str(D2)
## 'data.frame':    200 obs. of  3 variables:
##  $ CODE    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BOTTOM  : num  9 8.1 8.7 7.5 10.4 9 7.9 7.2 8.2 9.2 ...
##  $ DIAGONAL: num  141 142 142 142 142 ...

Take a look at the summary statistics:

summary(D2)
##       CODE         BOTTOM          DIAGONAL    
##  Min.   :0.0   Min.   : 7.200   Min.   :137.8  
##  1st Qu.:0.0   1st Qu.: 8.200   1st Qu.:139.5  
##  Median :0.5   Median : 9.100   Median :140.4  
##  Mean   :0.5   Mean   : 9.418   Mean   :140.5  
##  3rd Qu.:1.0   3rd Qu.:10.600   3rd Qu.:141.5  
##  Max.   :1.0   Max.   :12.700   Max.   :142.4
# For practical reasons, change the datatype of the 'CODE' colum
# to factor and re-label the classes to 'C0' and 'C1':
D2$CODE <- factor(ifelse(D2$CODE == 0, 'C0', 'C1'))

a) Apply the k-means algorithm to obtain 2 clusters. The k-Means clustering algortithm has a random part so to be able to reproduce your results, reset the random generator seed number.

set.seed(19032020)
# To apply k-Means clustering, we can use 'kmeans' function.
# The first argument is the data; remember to drop the true classes, i.e., the first column.
# The argument 'centers' is the numbers of distinct clusters.
# The number of clusters is often unknown.
D2.clusters <- kmeans(D2[,-1], centers = 2)
D2.clusters
## K-means clustering with 2 clusters of sizes 95, 105
## 
## Cluster means:
##      BOTTOM DIAGONAL
## 1 10.660000 139.4663
## 2  8.293333 141.4038
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1 1 1 1 2
## [112] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 122.30021  90.62381
##  (between_SS / total_SS =  68.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Cluster means: the centers used by the clustering algortihm.
#  Clustering vector: the cluster number which the corrersponding observation is associated with.
# In two-dimensional case, the results can be easily illustrated
# with a scatter plot. (When dimensions are more than two, one can use
# PCA for dimension reduction and plot the first two principal
# components, for example.)
plot(D2[,-1],
     pch = 16, col = D2.clusters$cluster)
title(main = 'Scatter plot -- colors represent the clusters\nLarge points represents the centers used by the algortihm', adj = 0)
# The display looks promising. Let's add the centers to the plot:
# (The cluster centers)
points(x = D2.clusters$centers[,1],
       y = D2.clusters$centers[,2],
       pch = 21, bg = c(1,2), cex = 2, col = c(2,1))

# Since we known the true classes, we can add the true labels to
# the plot (however, this makes the image fuzzy but works well
# in the demonstration purpose):
plot(D2[,-1],
     pch = 16, col = D2.clusters$cluster)

points(x = D2.clusters$centers[,1],
       y = D2.clusters$centers[,2],
       pch = 21, bg = c(1,2), cex = 2)
text(D2[,-1],
     labels = D2$CODE, pos = 3)

# Looks like the clustering algorithm works nicely as it separates
# the two classes quite well.
# Next we will count how many observations are clustered correctly.

b) How many observations are classified to a wrong category?

# Simple way to investigate this is to calculate the confusion matrix:
table(D2.clusters$cluster, D2[,1])
##    
##      C0  C1
##   1   0  95
##   2 100   5
# The rows are the clusters; the columns are the true classes.
# Interpreration: Code 0 belongs to the 1st cluster 0 times
# whereas 100 times to the 2nd cluster. Moreover, Code 1 belongs
# to the 1st cluster 95 times and 5 times to the 1st cluster.
# Conclusion: 5 observations are associated with the wrong cluster
# by the k-Means clustering algorithm, i.e., 2.5% of the
# observations.

c) Change the seed number and see if it affects the results.

You can run the k-Means clustering algortihm multiple times and see how the results change. In this case only the cluster number which Code 0 and Code 1 are associated, changes but the proportions does not. Try it yourself! Overall, as the initial centroids of k-Means are randomly selected, the random generator seed number changes the results. However, it should not change the results dramatically if there truly exists a structure in the data, which can be detected by the k-Means algorithm. A word of warning: clustering algortihm will nearly alway find as many clusters as you like; but those clusters are not guaranteed to be real clusters representing some real underlying phenomenon.

d) Apply the k-means algorithm to obtain 3 clusters. Does the seed number affect the results here?

# Clustering with three clusters (i.e., centers)
D2.3.clusters <- kmeans(D2[,-1], centers = 3)
# Scatter plot with colors corresponding to the cluster number:
plot(D2[,-1],
     pch = 16, col = D2.3.clusters$cluster)
# The cluster centers:
points(x = D2.3.clusters$centers[,1],
       y = D2.3.clusters$centers[,2],
       pch = 21, bg = c(1,2,3), col = c(2,1,1), cex = 2)

# Proportion of the clusters:
table(D2.3.clusters$cluster) / 200
## 
##     1     2     3 
## 0.295 0.250 0.455
# Remark that, typically in the clustering problem the labels are
# unknown. Therefore, it is essential to use expert validation
# / analyze the obtained clusters carefully. As we can see here,
# the clustering algortihm will find as many clusters we tell it to
# find!

x) Some analysis and details on clustering (this is extra)

# Moreover, in this case the proportion of the clusters most likely
# won't be as stable as in the two cluster case in part a.
# Let's repeat the clustering multiple times to demonstrate this.
# Again, this is not the cleanest way to do this, but will work fine in the demonstration purpose. (Note that, this is not the best way to analyze the consistency of the clusters -- there are better ways, one of them introduced below on the part 'Extra self study'.)
# Repeat the clustering for 100 times (this function returns a matrix)
D2.3.clusters.100.times <- replicate(100, kmeans(D2[,-1], centers = 3)$cluster)
# Calculate the cluster proportions:
apply(D2.3.clusters.100.times, 2, function(x) table(x) / 200)
##    [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
## 1 0.495 0.495 0.495 0.495 0.270 0.235 0.420 0.455 0.075 0.270 0.495 0.495 0.235
## 2 0.270 0.270 0.235 0.270 0.495 0.270 0.505 0.250 0.420 0.495 0.270 0.235 0.270
## 3 0.235 0.235 0.270 0.235 0.235 0.495 0.075 0.295 0.505 0.235 0.235 0.270 0.495
##   [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## 1 0.270 0.235 0.235 0.495 0.455 0.235 0.235 0.505 0.495 0.270 0.270 0.270 0.395
## 2 0.235 0.495 0.495 0.270 0.295 0.270 0.495 0.420 0.235 0.495 0.235 0.235 0.110
## 3 0.495 0.270 0.270 0.235 0.250 0.495 0.270 0.075 0.270 0.235 0.495 0.495 0.495
##   [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] [,39]
## 1 0.270 0.270 0.235 0.495 0.295 0.455 0.270 0.270 0.495 0.235 0.250 0.495 0.295
## 2 0.235 0.235 0.270 0.270 0.455 0.250 0.495 0.235 0.270 0.270 0.295 0.270 0.455
## 3 0.495 0.495 0.495 0.235 0.250 0.295 0.235 0.495 0.235 0.495 0.455 0.235 0.250
##   [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] [,51] [,52]
## 1 0.455 0.505 0.235 0.270 0.455 0.295 0.235 0.505 0.270 0.270 0.295 0.270 0.270
## 2 0.250 0.075 0.495 0.495 0.295 0.250 0.495 0.075 0.495 0.495 0.455 0.495 0.235
## 3 0.295 0.420 0.270 0.235 0.250 0.455 0.270 0.420 0.235 0.235 0.250 0.235 0.495
##   [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65]
## 1 0.235 0.270 0.495 0.270 0.455 0.250 0.270 0.495 0.250 0.495 0.250 0.235 0.235
## 2 0.270 0.235 0.235 0.235 0.295 0.455 0.235 0.270 0.295 0.235 0.455 0.495 0.270
## 3 0.495 0.495 0.270 0.495 0.250 0.295 0.495 0.235 0.455 0.270 0.295 0.270 0.495
##   [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77] [,78]
## 1 0.235 0.235 0.270   0.1 0.495 0.505 0.270 0.270 0.270 0.270 0.495 0.235 0.270
## 2 0.270 0.270 0.495   0.5 0.235 0.075 0.495 0.495 0.235 0.495 0.235 0.495 0.495
## 3 0.495 0.495 0.235   0.4 0.270 0.420 0.235 0.235 0.495 0.235 0.270 0.270 0.235
##   [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] [,91]
## 1   0.1 0.495 0.270 0.250   0.4 0.235 0.235 0.235 0.270 0.495 0.075 0.295   0.4
## 2   0.5 0.395 0.495 0.295   0.1 0.270 0.495 0.270 0.495 0.270 0.505 0.455   0.1
## 3   0.4 0.110 0.235 0.455   0.5 0.495 0.270 0.495 0.235 0.235 0.420 0.250   0.5
##   [,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99] [,100]
## 1 0.420 0.075 0.235 0.235 0.235 0.455 0.250 0.250  0.250
## 2 0.505 0.420 0.495 0.270 0.270 0.250 0.455 0.295  0.455
## 3 0.075 0.505 0.270 0.495 0.495 0.295 0.295 0.455  0.295
# Note two things:
# 1. The order of the clusters are random. (This is normal, remember that the initial centers are randomly selected)
# 2. The proportions of the clusters changes over the 100 repetitions. (The proportions should as stable as possible)

x) Extra self study

# For those interested in cluster analysis, one further way to analyse / validate the consistency 
# of the clusters, without an expert opinion, is so-called silhouette plot. For reference, see
# Rousseeuw, P.J. (1987) Silhouettes: "A graphical aid to the interpretation and 
# validation of cluster analysis". J. Comput. Appl. Math., 20, 53–65. (Google Scholar finds this)
# For a starting point and using the same data as we used here:
library(cluster)
# Two clusters:
D2.silhouette <- silhouette(D2.clusters$cluster, dist = dist(D2[,-1]))
summary(D2.silhouette)
## Silhouette of 200 units in 2 clusters from silhouette.default(x = D2.clusters$cluster, dist = dist(D2[,  from     -1])) :
##  Cluster sizes and average silhouette widths:
##        95       105 
## 0.5682153 0.6451334 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.01382  0.58640  0.66394  0.60860  0.71390  0.76918
plot(D2.silhouette)
# The average silhouette width:
abline(v = 0.61, lty = 3, col = 2)

# Three clusters:
D2.3.silhouette <- silhouette(D2.3.clusters$cluster, dist = dist(D2[,-1]))
summary(D2.3.silhouette)
## Silhouette of 200 units in 3 clusters from silhouette.default(x = D2.3.clusters$cluster, dist = dist(D2[,  from     -1])) :
##  Cluster sizes and average silhouette widths:
##        59        50        91 
## 0.2984230 0.3765736 0.5386047 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.09822  0.32967  0.45618  0.42724  0.57316  0.67932
plot(D2.3.silhouette)
# The average silhouette width:
abline(v = 0.54, lty = 3, col = 2)

# Interpretation in the nutshell:
# The silhouette width ranges from -1 to 1. The higher the number, the better the point(s)
# associates with its cluster and worse to other clusters. Values below zero suggests that
# the points is better associated with some other cluster. Note the few outliers in the silhoutte plot above (few observations have negative width)
# Please see the reference above for more formal definition.