The data set polls.txt contains voting data of 11 different regions from the 2017 municipal elections. The variables are the percentages of votes each of Finland’s eight largest political parties gained in the elections. We will use clustering methods to see which regions resemble each other most closely.
**Read the data:**
Later, we will be using functions melt and hist3D which require imports.
# install.packages("reshape2")
# install.packages("plot3D")
library(reshape2)
library(plot3D)
D <- read.csv('polls.txt', header = T, sep = '\t', row.names = 1)
D
## KOK SDP KESK VIHR PS VAS RKP KD
## Uusimaa 26.3 17.0 5.4 19.7 8.3 8.0 8.4 3.1
## Varsinais-Suomi 23.7 19.3 14.0 13.2 7.8 11.1 4.9 3.0
## Kanta-Hame 23.2 25.8 16.4 9.4 8.9 10.1 0.0 5.3
## Pirkanmaa 22.5 22.9 11.5 14.8 8.3 9.8 0.2 4.8
## Paijat-Hame 23.6 25.7 11.8 9.3 10.2 5.0 0.0 5.9
## Kymenlaakso 22.1 26.4 14.3 10.0 10.5 6.4 0.7 5.1
## Pohjois-Savo 16.2 19.1 29.7 8.0 10.1 10.2 0.0 5.7
## Pohjois-Karjala 12.4 26.2 30.3 7.6 11.4 6.5 0.0 4.7
## Pohjanmaa 8.1 15.0 5.5 3.3 5.0 4.2 48.1 7.0
## Kainuu 11.9 9.8 39.1 5.9 10.7 18.8 0.0 3.2
## Lappi 13.8 15.0 36.0 4.3 6.8 17.9 0.2 1.6
(a) Scatter plot the variables. Can you spot the different clusters?
region.colors <- colorRampPalette(colors = c('red','blue'))(nrow(D))
pairs(D, upper.panel = NULL,
pch = 16, col = region.colors, cex = 1.25)
(b) Calculate the Euclidean distances between the regions
D.dist <- dist(D, method = 'euclidean', diag = T, upper = F)
round(as.matrix(D.dist), 1)
## Uusimaa Varsinais-Suomi Kanta-Hame Pirkanmaa Paijat-Hame
## Uusimaa 0.0 12.3 19.9 13.6 18.0
## Varsinais-Suomi 12.3 0.0 9.7 7.1 11.7
## Kanta-Hame 19.9 9.7 0.0 7.9 7.0
## Pirkanmaa 13.6 7.1 7.9 0.0 8.2
## Paijat-Hame 18.0 11.7 7.0 8.2 0.0
## Kymenlaakso 18.7 10.7 4.8 7.7 3.6
## Pohjois-Savo 30.3 19.2 16.6 20.9 21.2
## Pohjois-Karjala 33.6 23.1 18.2 23.2 21.8
## Pohjanmaa 47.1 48.7 53.5 52.7 52.7
## Kainuu 42.2 31.7 31.5 34.8 36.7
## Lappi 38.8 27.4 26.3 30.4 31.9
## Kymenlaakso Pohjois-Savo Pohjois-Karjala Pohjanmaa Kainuu Lappi
## Uusimaa 18.7 30.3 33.6 47.1 42.2 38.8
## Varsinais-Suomi 10.7 19.2 23.1 48.7 31.7 27.4
## Kanta-Hame 4.8 16.6 18.2 53.5 31.5 26.3
## Pirkanmaa 7.7 20.9 23.2 52.7 34.8 30.4
## Paijat-Hame 3.6 21.2 21.8 52.7 36.7 31.9
## Kymenlaakso 0.0 18.6 18.9 52.3 34.2 29.3
## Pohjois-Savo 18.6 0.0 9.0 55.4 16.7 12.8
## Pohjois-Karjala 18.9 9.0 0.0 56.1 22.4 18.2
## Pohjanmaa 52.3 55.4 56.1 0.0 61.2 59.0
## Kainuu 34.2 16.7 22.4 61.2 0.0 7.8
## Lappi 29.3 12.8 18.2 59.0 7.8 0.0
(c) Perform the “bottom up” hierarchical clustering by hand. Aggregate two clusters using the minimum distance (single linkage).
This is one way to obtain the and rank the Euclidean distances (not necessarily the cleanest). First, convert the distance matrix to a matrix:
D.dist.matrix <- as.matrix(D.dist)
round(D.dist.matrix, 1)
## Uusimaa Varsinais-Suomi Kanta-Hame Pirkanmaa Paijat-Hame
## Uusimaa 0.0 12.3 19.9 13.6 18.0
## Varsinais-Suomi 12.3 0.0 9.7 7.1 11.7
## Kanta-Hame 19.9 9.7 0.0 7.9 7.0
## Pirkanmaa 13.6 7.1 7.9 0.0 8.2
## Paijat-Hame 18.0 11.7 7.0 8.2 0.0
## Kymenlaakso 18.7 10.7 4.8 7.7 3.6
## Pohjois-Savo 30.3 19.2 16.6 20.9 21.2
## Pohjois-Karjala 33.6 23.1 18.2 23.2 21.8
## Pohjanmaa 47.1 48.7 53.5 52.7 52.7
## Kainuu 42.2 31.7 31.5 34.8 36.7
## Lappi 38.8 27.4 26.3 30.4 31.9
## Kymenlaakso Pohjois-Savo Pohjois-Karjala Pohjanmaa Kainuu Lappi
## Uusimaa 18.7 30.3 33.6 47.1 42.2 38.8
## Varsinais-Suomi 10.7 19.2 23.1 48.7 31.7 27.4
## Kanta-Hame 4.8 16.6 18.2 53.5 31.5 26.3
## Pirkanmaa 7.7 20.9 23.2 52.7 34.8 30.4
## Paijat-Hame 3.6 21.2 21.8 52.7 36.7 31.9
## Kymenlaakso 0.0 18.6 18.9 52.3 34.2 29.3
## Pohjois-Savo 18.6 0.0 9.0 55.4 16.7 12.8
## Pohjois-Karjala 18.9 9.0 0.0 56.1 22.4 18.2
## Pohjanmaa 52.3 55.4 56.1 0.0 61.2 59.0
## Kainuu 34.2 16.7 22.4 61.2 0.0 7.8
## Lappi 29.3 12.8 18.2 59.0 7.8 0.0
As the distance \(N \times N\) matrix is symmetric, we can neglect the upper triangle (change the upper triangle elements into NA):
D.dist.matrix[upper.tri(D.dist.matrix)] <- NA
round(D.dist.matrix, 1)
## Uusimaa Varsinais-Suomi Kanta-Hame Pirkanmaa Paijat-Hame
## Uusimaa 0.0 NA NA NA NA
## Varsinais-Suomi 12.3 0.0 NA NA NA
## Kanta-Hame 19.9 9.7 0.0 NA NA
## Pirkanmaa 13.6 7.1 7.9 0.0 NA
## Paijat-Hame 18.0 11.7 7.0 8.2 0.0
## Kymenlaakso 18.7 10.7 4.8 7.7 3.6
## Pohjois-Savo 30.3 19.2 16.6 20.9 21.2
## Pohjois-Karjala 33.6 23.1 18.2 23.2 21.8
## Pohjanmaa 47.1 48.7 53.5 52.7 52.7
## Kainuu 42.2 31.7 31.5 34.8 36.7
## Lappi 38.8 27.4 26.3 30.4 31.9
## Kymenlaakso Pohjois-Savo Pohjois-Karjala Pohjanmaa Kainuu Lappi
## Uusimaa NA NA NA NA NA NA
## Varsinais-Suomi NA NA NA NA NA NA
## Kanta-Hame NA NA NA NA NA NA
## Pirkanmaa NA NA NA NA NA NA
## Paijat-Hame NA NA NA NA NA NA
## Kymenlaakso 0.0 NA NA NA NA NA
## Pohjois-Savo 18.6 0.0 NA NA NA NA
## Pohjois-Karjala 18.9 9.0 0.0 NA NA NA
## Pohjanmaa 52.3 55.4 56.1 0.0 NA NA
## Kainuu 34.2 16.7 22.4 61.2 0.0 NA
## Lappi 29.3 12.8 18.2 59.0 7.8 0
We can reformulate our \(N \times N\) matrix to a "stacked matrix" so it is easier to analyze:
D.dist.melt <- melt(D.dist.matrix)
head(D.dist.melt, 20) # first 20 rows
## Var1 Var2 value
## 1 Uusimaa Uusimaa 0.000000
## 2 Varsinais-Suomi Uusimaa 12.262952
## 3 Kanta-Hame Uusimaa 19.857240
## 4 Pirkanmaa Uusimaa 13.558761
## 5 Paijat-Hame Uusimaa 17.975261
## 6 Kymenlaakso Uusimaa 18.708020
## 7 Pohjois-Savo Uusimaa 30.318311
## 8 Pohjois-Karjala Uusimaa 33.604315
## 9 Pohjanmaa Uusimaa 47.125789
## 10 Kainuu Uusimaa 42.169894
## 11 Lappi Uusimaa 38.775250
## 12 Uusimaa Varsinais-Suomi NA
## 13 Varsinais-Suomi Varsinais-Suomi 0.000000
## 14 Kanta-Hame Varsinais-Suomi 9.706184
## 15 Pirkanmaa Varsinais-Suomi 7.104928
## 16 Paijat-Hame Varsinais-Suomi 11.679469
## 17 Kymenlaakso Varsinais-Suomi 10.711209
## 18 Pohjois-Savo Varsinais-Suomi 19.162985
## 19 Pohjois-Karjala Varsinais-Suomi 23.094805
## 20 Pohjanmaa Varsinais-Suomi 48.680592
Finally, sort the euclidean distances from smallest to largest:
D.dist.melt <- D.dist.melt[order(D.dist.melt$value),]
head(D.dist.melt, 20)
## Var1 Var2 value
## 1 Uusimaa Uusimaa 0.000000
## 13 Varsinais-Suomi Varsinais-Suomi 0.000000
## 25 Kanta-Hame Kanta-Hame 0.000000
## 37 Pirkanmaa Pirkanmaa 0.000000
## 49 Paijat-Hame Paijat-Hame 0.000000
## 61 Kymenlaakso Kymenlaakso 0.000000
## 73 Pohjois-Savo Pohjois-Savo 0.000000
## 85 Pohjois-Karjala Pohjois-Karjala 0.000000
## 97 Pohjanmaa Pohjanmaa 0.000000
## 109 Kainuu Kainuu 0.000000
## 121 Lappi Lappi 0.000000
## 50 Kymenlaakso Paijat-Hame 3.558089
## 28 Kymenlaakso Kanta-Hame 4.808326
## 27 Paijat-Hame Kanta-Hame 7.028513
## 15 Pirkanmaa Varsinais-Suomi 7.104928
## 39 Kymenlaakso Pirkanmaa 7.747903
## 110 Lappi Kainuu 7.838367
## 26 Pirkanmaa Kanta-Hame 7.925276
## 38 Paijat-Hame Pirkanmaa 8.203048
## 74 Pohjois-Karjala Pohjois-Savo 9.041571
# Look at the data and use the algortihm for hierarchical clustering using the
# minimum distance:
D.dist.melt
## Var1 Var2 value
## 1 Uusimaa Uusimaa 0.000000
## 13 Varsinais-Suomi Varsinais-Suomi 0.000000
## 25 Kanta-Hame Kanta-Hame 0.000000
## 37 Pirkanmaa Pirkanmaa 0.000000
## 49 Paijat-Hame Paijat-Hame 0.000000
## 61 Kymenlaakso Kymenlaakso 0.000000
## 73 Pohjois-Savo Pohjois-Savo 0.000000
## 85 Pohjois-Karjala Pohjois-Karjala 0.000000
## 97 Pohjanmaa Pohjanmaa 0.000000
## 109 Kainuu Kainuu 0.000000
## 121 Lappi Lappi 0.000000
## 50 Kymenlaakso Paijat-Hame 3.558089
## 28 Kymenlaakso Kanta-Hame 4.808326
## 27 Paijat-Hame Kanta-Hame 7.028513
## 15 Pirkanmaa Varsinais-Suomi 7.104928
## 39 Kymenlaakso Pirkanmaa 7.747903
## 110 Lappi Kainuu 7.838367
## 26 Pirkanmaa Kanta-Hame 7.925276
## 38 Paijat-Hame Pirkanmaa 8.203048
## 74 Pohjois-Karjala Pohjois-Savo 9.041571
## 14 Kanta-Hame Varsinais-Suomi 9.706184
## 17 Kymenlaakso Varsinais-Suomi 10.711209
## 16 Paijat-Hame Varsinais-Suomi 11.679469
## 2 Varsinais-Suomi Uusimaa 12.262952
## 77 Lappi Pohjois-Savo 12.766362
## 4 Pirkanmaa Uusimaa 13.558761
## 29 Pohjois-Savo Kanta-Hame 16.563514
## 76 Kainuu Pohjois-Savo 16.682925
## 5 Paijat-Hame Uusimaa 17.975261
## 88 Lappi Pohjois-Karjala 18.208514
## 30 Pohjois-Karjala Kanta-Hame 18.243355
## 62 Pohjois-Savo Kymenlaakso 18.566367
## 6 Kymenlaakso Uusimaa 18.708020
## 63 Pohjois-Karjala Kymenlaakso 18.903968
## 18 Pohjois-Savo Varsinais-Suomi 19.162985
## 3 Kanta-Hame Uusimaa 19.857240
## 40 Pohjois-Savo Pirkanmaa 20.877260
## 51 Pohjois-Savo Paijat-Hame 21.154432
## 52 Pohjois-Karjala Paijat-Hame 21.816508
## 87 Kainuu Pohjois-Karjala 22.440365
## 19 Pohjois-Karjala Varsinais-Suomi 23.094805
## 41 Pohjois-Karjala Pirkanmaa 23.210558
## 33 Lappi Kanta-Hame 26.346727
## 22 Lappi Varsinais-Suomi 27.404379
## 66 Lappi Kymenlaakso 29.337178
## 7 Pohjois-Savo Uusimaa 30.318311
## 44 Lappi Pirkanmaa 30.441748
## 32 Kainuu Kanta-Hame 31.536804
## 21 Kainuu Varsinais-Suomi 31.694479
## 55 Lappi Paijat-Hame 31.900940
## 8 Pohjois-Karjala Uusimaa 33.604315
## 65 Kainuu Kymenlaakso 34.195760
## 43 Kainuu Pirkanmaa 34.846808
## 54 Kainuu Paijat-Hame 36.667833
## 11 Lappi Uusimaa 38.775250
## 10 Kainuu Uusimaa 42.169894
## 9 Pohjanmaa Uusimaa 47.125789
## 20 Pohjanmaa Varsinais-Suomi 48.680592
## 64 Pohjanmaa Kymenlaakso 52.285275
## 53 Pohjanmaa Paijat-Hame 52.658618
## 42 Pohjanmaa Pirkanmaa 52.721153
## 31 Pohjanmaa Kanta-Hame 53.546148
## 75 Pohjanmaa Pohjois-Savo 55.384655
## 86 Pohjanmaa Pohjois-Karjala 56.058987
## 99 Lappi Pohjanmaa 58.976606
## 98 Kainuu Pohjanmaa 61.244592
## 12 Uusimaa Varsinais-Suomi NA
## 23 Uusimaa Kanta-Hame NA
## 24 Varsinais-Suomi Kanta-Hame NA
## 34 Uusimaa Pirkanmaa NA
## 35 Varsinais-Suomi Pirkanmaa NA
## 36 Kanta-Hame Pirkanmaa NA
## 45 Uusimaa Paijat-Hame NA
## 46 Varsinais-Suomi Paijat-Hame NA
## 47 Kanta-Hame Paijat-Hame NA
## 48 Pirkanmaa Paijat-Hame NA
## 56 Uusimaa Kymenlaakso NA
## 57 Varsinais-Suomi Kymenlaakso NA
## 58 Kanta-Hame Kymenlaakso NA
## 59 Pirkanmaa Kymenlaakso NA
## 60 Paijat-Hame Kymenlaakso NA
## 67 Uusimaa Pohjois-Savo NA
## 68 Varsinais-Suomi Pohjois-Savo NA
## 69 Kanta-Hame Pohjois-Savo NA
## 70 Pirkanmaa Pohjois-Savo NA
## 71 Paijat-Hame Pohjois-Savo NA
## 72 Kymenlaakso Pohjois-Savo NA
## 78 Uusimaa Pohjois-Karjala NA
## 79 Varsinais-Suomi Pohjois-Karjala NA
## 80 Kanta-Hame Pohjois-Karjala NA
## 81 Pirkanmaa Pohjois-Karjala NA
## 82 Paijat-Hame Pohjois-Karjala NA
## 83 Kymenlaakso Pohjois-Karjala NA
## 84 Pohjois-Savo Pohjois-Karjala NA
## 89 Uusimaa Pohjanmaa NA
## 90 Varsinais-Suomi Pohjanmaa NA
## 91 Kanta-Hame Pohjanmaa NA
## 92 Pirkanmaa Pohjanmaa NA
## 93 Paijat-Hame Pohjanmaa NA
## 94 Kymenlaakso Pohjanmaa NA
## 95 Pohjois-Savo Pohjanmaa NA
## 96 Pohjois-Karjala Pohjanmaa NA
## 100 Uusimaa Kainuu NA
## 101 Varsinais-Suomi Kainuu NA
## 102 Kanta-Hame Kainuu NA
## 103 Pirkanmaa Kainuu NA
## 104 Paijat-Hame Kainuu NA
## 105 Kymenlaakso Kainuu NA
## 106 Pohjois-Savo Kainuu NA
## 107 Pohjois-Karjala Kainuu NA
## 108 Pohjanmaa Kainuu NA
## 111 Uusimaa Lappi NA
## 112 Varsinais-Suomi Lappi NA
## 113 Kanta-Hame Lappi NA
## 114 Pirkanmaa Lappi NA
## 115 Paijat-Hame Lappi NA
## 116 Kymenlaakso Lappi NA
## 117 Pohjois-Savo Lappi NA
## 118 Pohjois-Karjala Lappi NA
## 119 Pohjanmaa Lappi NA
## 120 Kainuu Lappi NA
In hierarchical clustering each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Below, the numbers in parenthesis denote the euclidean distances between the observations.
Kymenlaakso and Paijat-Hame have the smallest distance and, therefore, form the first first cluster. Looking at the distances we see that next smallest distance is between Kymenlaakso and Kanta-Hame. Therefore, Kanta-Hame is attached to the first cluster. Looking again at the distances we find that the next pair is Paijat-Hame and Kanta-Hame which are all included in the first cluster. While moving down the list we notice that the nextpair is Pirkanmaa and Varsinais-Suomi and, therefore, form the second cluster. The third cluster is Lappi and Kainuu. The fourth is Pohjois-Savo and Pohjois-Karjala.
#(Paijat-Hame, Kymenlaakso) 3.6 #(Kanta-Hame, Paijat-Hame, Kymenlaakso) 4.8 #(Kanta-Hame, Paijat-Hame, Kymenlaakso), (Varsinais-Suomi, Pirkanmaa) 7.1 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa) 7.7 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa), (Lappi, Kainuu) 7.8 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa), (Lappi, Kainuu), (Pohjois-Savo, Pohjois-Karjala) 9.0 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa, Uusimaa), (Lappi, Kainuu), (Pohjois-Savo, Pohjois-Karjala) 12.3 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa, Uusimaa), (Lappi, Kainuu, Pohjois-Savo, Pohjois-Karjala) 12.8 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa, Uusimaa, Lappi, Kainuu, Pohjois-Savo, Pohjois-Karjala) 16.6 #(Kanta-Hame, Paijat-Hame, Kymenlaakso, Varsinais-Suomi, Pirkanmaa, Uusimaa, Lappi, Kainuu, Pohjois-Savo, Pohjois-Karjala, Pohjanmaa) 47.1
d) Repeat (c) using the function hclust()
The first argument is a distance object, such as euclidean distance. The argument 'method' controls the agglomerative method to be used in the fitting. Here, 'single' is almost the same as the minimun distance, single-linkage method we used manually above.
D.hierarchical.clustering <- hclust(D.dist, method = 'single')
D.hierarchical.clustering
##
## Call:
## hclust(d = D.dist, method = "single")
##
## Cluster method : single
## Distance : euclidean
## Number of objects: 11
Plotting
plot(D.hierarchical.clustering,
xlab = 'Region / Clusters', ylab = 'Tree height [Euclidean distance]')
Does the results match the one we did manually?
plot(D.hierarchical.clustering,
xlab = 'Region / Clusters', ylab = 'Tree height [Euclidean distance]')
abline(h = c(3.55, 7.10, 7.83, 9.04, 4.80, 7.74, 12.26, 12.76, 16.56, 47.12),
lty = 3, col = 2)
(f) Repeat the steps by aggregating the clusters using the average (average linkage) and the maximum (complete linkage). Compare the results.
Here the idea is to compare different agglomerative algortihms, which you can select with the 'method' argument in the 'hclust' function.
Let's plot differently algorithms side by side:
par(mfrow = c(1,3))
plot(hclust(D.dist, method = 'single'))
plot(hclust(D.dist, method = 'average'))
plot(hclust(D.dist, method = 'complete'))
(g) Where would you cut the tree?
So, growing the tree is fairly simple and quick task. The difficult part is to choose where to cut the tree, i.e., choose the clusters. This is typically an expert decision as it requires context knowledge.
Let's plot our minimum distance tree again.
plot(D.hierarchical.clustering,
xlab = 'Region / Clusters', ylab = 'Tree height [Euclidean distance]')
abline(h = 10,
lty = 1, col = 2)
If we use the cut-point, say 10 (we cut the tree at 'height' 10), we obtain 5 clusters.The next step would be to annotate the clusters in a detailed fashion, i.e., to investigate what is the underlying reason the clusters tend to be similar with respect to the covariates used in the analysis. Function 'cutree' can be used to cut the tree and obtain the clusters accordinly.
cutree(D.hierarchical.clustering,
h = 10)
## Uusimaa Varsinais-Suomi Kanta-Hame Pirkanmaa Paijat-Hame
## 1 2 2 2 2
## Kymenlaakso Pohjois-Savo Pohjois-Karjala Pohjanmaa Kainuu
## 2 3 3 4 5
## Lappi
## 5
Use the data BANK.txt. The first column contains the true class of the observations.
Read the data
D2 <- read.csv('BANK.txt', sep = '\t', header = T)
Take a look at the data type:
str(D2)
## 'data.frame': 200 obs. of 3 variables:
## $ CODE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BOTTOM : num 9 8.1 8.7 7.5 10.4 9 7.9 7.2 8.2 9.2 ...
## $ DIAGONAL: num 141 142 142 142 142 ...
Take a look at the summary statistics:
summary(D2)
## CODE BOTTOM DIAGONAL
## Min. :0.0 Min. : 7.200 Min. :137.8
## 1st Qu.:0.0 1st Qu.: 8.200 1st Qu.:139.5
## Median :0.5 Median : 9.100 Median :140.4
## Mean :0.5 Mean : 9.418 Mean :140.5
## 3rd Qu.:1.0 3rd Qu.:10.600 3rd Qu.:141.5
## Max. :1.0 Max. :12.700 Max. :142.4
# For practical reasons, change the datatype of the 'CODE' colum
# to factor and re-label the classes to 'C0' and 'C1':
D2$CODE <- factor(ifelse(D2$CODE == 0, 'C0', 'C1'))
a) Apply the k-means algorithm to obtain 2 clusters. The k-Means clustering algortithm has a random part so to be able to reproduce your results, reset the random generator seed number.
set.seed(19032020)
# To apply k-Means clustering, we can use 'kmeans' function.
# The first argument is the data; remember to drop the true classes, i.e., the first column.
# The argument 'centers' is the numbers of distinct clusters.
# The number of clusters is often unknown.
D2.clusters <- kmeans(D2[,-1], centers = 2)
D2.clusters
## K-means clustering with 2 clusters of sizes 95, 105
##
## Cluster means:
## BOTTOM DIAGONAL
## 1 10.660000 139.4663
## 2 8.293333 141.4038
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1 1 1 1 2
## [112] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 122.30021 90.62381
## (between_SS / total_SS = 68.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Cluster means: the centers used by the clustering algortihm.
# Clustering vector: the cluster number which the corrersponding observation is associated with.
# In two-dimensional case, the results can be easily illustrated
# with a scatter plot. (When dimensions are more than two, one can use
# PCA for dimension reduction and plot the first two principal
# components, for example.)
plot(D2[,-1],
pch = 16, col = D2.clusters$cluster)
title(main = 'Scatter plot -- colors represent the clusters\nLarge points represents the centers used by the algortihm', adj = 0)
# The display looks promising. Let's add the centers to the plot:
# (The cluster centers)
points(x = D2.clusters$centers[,1],
y = D2.clusters$centers[,2],
pch = 21, bg = c(1,2), cex = 2, col = c(2,1))
# Since we known the true classes, we can add the true labels to
# the plot (however, this makes the image fuzzy but works well
# in the demonstration purpose):
plot(D2[,-1],
pch = 16, col = D2.clusters$cluster)
points(x = D2.clusters$centers[,1],
y = D2.clusters$centers[,2],
pch = 21, bg = c(1,2), cex = 2)
text(D2[,-1],
labels = D2$CODE, pos = 3)
# Looks like the clustering algorithm works nicely as it separates
# the two classes quite well.
# Next we will count how many observations are clustered correctly.
b) How many observations are classified to a wrong category?
# Simple way to investigate this is to calculate the confusion matrix:
table(D2.clusters$cluster, D2[,1])
##
## C0 C1
## 1 0 95
## 2 100 5
# The rows are the clusters; the columns are the true classes.
# Interpreration: Code 0 belongs to the 1st cluster 0 times
# whereas 100 times to the 2nd cluster. Moreover, Code 1 belongs
# to the 1st cluster 95 times and 5 times to the 1st cluster.
# Conclusion: 5 observations are associated with the wrong cluster
# by the k-Means clustering algorithm, i.e., 2.5% of the
# observations.
c) Change the seed number and see if it affects the results.
d) Apply the k-means algorithm to obtain 3 clusters. Does the seed number affect the results here?
# Clustering with three clusters (i.e., centers)
D2.3.clusters <- kmeans(D2[,-1], centers = 3)
# Scatter plot with colors corresponding to the cluster number:
plot(D2[,-1],
pch = 16, col = D2.3.clusters$cluster)
# The cluster centers:
points(x = D2.3.clusters$centers[,1],
y = D2.3.clusters$centers[,2],
pch = 21, bg = c(1,2,3), col = c(2,1,1), cex = 2)
# Proportion of the clusters:
table(D2.3.clusters$cluster) / 200
##
## 1 2 3
## 0.295 0.250 0.455
# Remark that, typically in the clustering problem the labels are
# unknown. Therefore, it is essential to use expert validation
# / analyze the obtained clusters carefully. As we can see here,
# the clustering algortihm will find as many clusters we tell it to
# find!
x) Some analysis and details on clustering (this is extra)
# Moreover, in this case the proportion of the clusters most likely
# won't be as stable as in the two cluster case in part a.
# Let's repeat the clustering multiple times to demonstrate this.
# Again, this is not the cleanest way to do this, but will work fine in the demonstration purpose. (Note that, this is not the best way to analyze the consistency of the clusters -- there are better ways, one of them introduced below on the part 'Extra self study'.)
# Repeat the clustering for 100 times (this function returns a matrix)
D2.3.clusters.100.times <- replicate(100, kmeans(D2[,-1], centers = 3)$cluster)
# Calculate the cluster proportions:
apply(D2.3.clusters.100.times, 2, function(x) table(x) / 200)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## 1 0.495 0.495 0.495 0.495 0.270 0.235 0.420 0.455 0.075 0.270 0.495 0.495 0.235
## 2 0.270 0.270 0.235 0.270 0.495 0.270 0.505 0.250 0.420 0.495 0.270 0.235 0.270
## 3 0.235 0.235 0.270 0.235 0.235 0.495 0.075 0.295 0.505 0.235 0.235 0.270 0.495
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## 1 0.270 0.235 0.235 0.495 0.455 0.235 0.235 0.505 0.495 0.270 0.270 0.270 0.395
## 2 0.235 0.495 0.495 0.270 0.295 0.270 0.495 0.420 0.235 0.495 0.235 0.235 0.110
## 3 0.495 0.270 0.270 0.235 0.250 0.495 0.270 0.075 0.270 0.235 0.495 0.495 0.495
## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] [,39]
## 1 0.270 0.270 0.235 0.495 0.295 0.455 0.270 0.270 0.495 0.235 0.250 0.495 0.295
## 2 0.235 0.235 0.270 0.270 0.455 0.250 0.495 0.235 0.270 0.270 0.295 0.270 0.455
## 3 0.495 0.495 0.495 0.235 0.250 0.295 0.235 0.495 0.235 0.495 0.455 0.235 0.250
## [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] [,51] [,52]
## 1 0.455 0.505 0.235 0.270 0.455 0.295 0.235 0.505 0.270 0.270 0.295 0.270 0.270
## 2 0.250 0.075 0.495 0.495 0.295 0.250 0.495 0.075 0.495 0.495 0.455 0.495 0.235
## 3 0.295 0.420 0.270 0.235 0.250 0.455 0.270 0.420 0.235 0.235 0.250 0.235 0.495
## [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65]
## 1 0.235 0.270 0.495 0.270 0.455 0.250 0.270 0.495 0.250 0.495 0.250 0.235 0.235
## 2 0.270 0.235 0.235 0.235 0.295 0.455 0.235 0.270 0.295 0.235 0.455 0.495 0.270
## 3 0.495 0.495 0.270 0.495 0.250 0.295 0.495 0.235 0.455 0.270 0.295 0.270 0.495
## [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77] [,78]
## 1 0.235 0.235 0.270 0.1 0.495 0.505 0.270 0.270 0.270 0.270 0.495 0.235 0.270
## 2 0.270 0.270 0.495 0.5 0.235 0.075 0.495 0.495 0.235 0.495 0.235 0.495 0.495
## 3 0.495 0.495 0.235 0.4 0.270 0.420 0.235 0.235 0.495 0.235 0.270 0.270 0.235
## [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] [,91]
## 1 0.1 0.495 0.270 0.250 0.4 0.235 0.235 0.235 0.270 0.495 0.075 0.295 0.4
## 2 0.5 0.395 0.495 0.295 0.1 0.270 0.495 0.270 0.495 0.270 0.505 0.455 0.1
## 3 0.4 0.110 0.235 0.455 0.5 0.495 0.270 0.495 0.235 0.235 0.420 0.250 0.5
## [,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99] [,100]
## 1 0.420 0.075 0.235 0.235 0.235 0.455 0.250 0.250 0.250
## 2 0.505 0.420 0.495 0.270 0.270 0.250 0.455 0.295 0.455
## 3 0.075 0.505 0.270 0.495 0.495 0.295 0.295 0.455 0.295
# Note two things:
# 1. The order of the clusters are random. (This is normal, remember that the initial centers are randomly selected)
# 2. The proportions of the clusters changes over the 100 repetitions. (The proportions should as stable as possible)
x) Extra self study
# For those interested in cluster analysis, one further way to analyse / validate the consistency
# of the clusters, without an expert opinion, is so-called silhouette plot. For reference, see
# Rousseeuw, P.J. (1987) Silhouettes: "A graphical aid to the interpretation and
# validation of cluster analysis". J. Comput. Appl. Math., 20, 53–65. (Google Scholar finds this)
# For a starting point and using the same data as we used here:
library(cluster)
# Two clusters:
D2.silhouette <- silhouette(D2.clusters$cluster, dist = dist(D2[,-1]))
summary(D2.silhouette)
## Silhouette of 200 units in 2 clusters from silhouette.default(x = D2.clusters$cluster, dist = dist(D2[, from -1])) :
## Cluster sizes and average silhouette widths:
## 95 105
## 0.5682153 0.6451334
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.01382 0.58640 0.66394 0.60860 0.71390 0.76918
plot(D2.silhouette)
# The average silhouette width:
abline(v = 0.61, lty = 3, col = 2)
# Three clusters:
D2.3.silhouette <- silhouette(D2.3.clusters$cluster, dist = dist(D2[,-1]))
summary(D2.3.silhouette)
## Silhouette of 200 units in 3 clusters from silhouette.default(x = D2.3.clusters$cluster, dist = dist(D2[, from -1])) :
## Cluster sizes and average silhouette widths:
## 59 50 91
## 0.2984230 0.3765736 0.5386047
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.09822 0.32967 0.45618 0.42724 0.57316 0.67932
plot(D2.3.silhouette)
# The average silhouette width:
abline(v = 0.54, lty = 3, col = 2)
# Interpretation in the nutshell:
# The silhouette width ranges from -1 to 1. The higher the number, the better the point(s)
# associates with its cluster and worse to other clusters. Values below zero suggests that
# the points is better associated with some other cluster. Note the few outliers in the silhoutte plot above (few observations have negative width)
# Please see the reference above for more formal definition.