car <- read.table("CAR.txt",header=T,sep="\t")
str(car)
## 'data.frame': 24 obs. of 10 variables:
## $ Type : Factor w/ 22 levels "Audi","BMW","Citroen",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Model : Factor w/ 20 levels "","1.3","100",..: 3 8 10 1 19 13 1 1 17 7 ...
## $ Economy: num 3.9 4.8 3 5.3 2.1 2.3 2.5 4.6 3.2 2.6 ...
## $ Service: num 2.8 1.6 3.8 2.9 3.9 3.1 3.4 2.4 3.9 3.3 ...
## $ Value : num 2.2 1.9 3.8 2.2 4 3.4 3.2 1.6 4.3 3.7 ...
## $ Price : num 4.2 5 2.7 5.9 2.6 2.6 2.2 5.5 2 2.8 ...
## $ Design : num 3 2 4 1.7 4.5 3.2 3.3 1.3 4.3 3.7 ...
## $ Sport : num 3.1 2.5 4.4 1.1 4.4 3.3 3.3 1.6 4.5 3 ...
## $ Safety : num 2.4 1.6 4 3.3 4.4 3.6 3.3 2.8 4.7 3.7 ...
## $ Easy.h.: num 2.8 2.8 2.6 4.3 2.2 2.8 2.4 3.6 2.9 3.1 ...
summary(car)
## Type Model Economy Service Value
## Opel : 2 : 5 Min. :2.100 Min. :1.600 Min. :1.600
## VW : 2 1.3 : 1 1st Qu.:2.575 1st Qu.:2.400 1st Qu.:2.175
## Audi : 1 100 : 1 Median :3.100 Median :2.900 Median :3.200
## BMW : 1 19 : 1 Mean :3.254 Mean :3.021 Mean :3.104
## Citroen: 1 200 : 1 3rd Qu.:3.825 3rd Qu.:3.425 3rd Qu.:3.725
## Ferrari: 1 306 : 1 Max. :5.300 Max. :4.700 Max. :5.500
## (Other):16 (Other):14
## Price Design Sport Safety
## Min. :1.500 Min. :1.30 Min. :1.100 Min. :1.400
## 1st Qu.:2.600 1st Qu.:2.95 1st Qu.:3.075 1st Qu.:2.800
## Median :2.900 Median :3.20 Median :3.300 Median :3.200
## Mean :3.246 Mean :3.20 Mean :3.450 Mean :3.279
## 3rd Qu.:4.050 3rd Qu.:3.55 3rd Qu.:3.925 3rd Qu.:3.725
## Max. :5.900 Max. :4.80 Max. :5.800 Max. :5.900
##
## Easy.h.
## Min. :1.600
## 1st Qu.:2.400
## Median :2.650
## Mean :2.737
## 3rd Qu.:2.925
## Max. :4.300
##
# Convert X and Y to matrices for future operations
# (note that e.g. matrix multiplication not possible for data.frame objects)
X <- as.matrix(car[,c(6,5)]) # X = (Price, Value)
Y <- as.matrix(car[,c(3,4,7:10)]) # Y = (Economy, Service, Design, Sport, Safety, Easy.h)
# Combine X and Y
XY <- cbind(X,Y)
# You can give rownames to your matrix, might help later in the interpretation part:
rownames(XY) <- paste(car$Type, car$Model)
Let \(x\) be a \(p\)-variate random vector and let \(y\) be a \(q\)-variate random vector. In canonical correlation analysis the aim is to find the canonical vectors \(\alpha\) and \(\beta\) such that the linear combinations \(\alpha^T x\) and \(\beta^T y\) have the largest possible correlation. In other words, we want to maximize \(\rho_k\):
\[\begin{equation} \rho_k = |corr(u_k, v_k)| \end{equation}\]such that \(var(u_k) = var(v_k) = 1\) where \(u_k = \alpha_k^{T}x\) and \(v_k = \beta_k^{T}x\). The canonical vectors vectors \(\alpha_k\) and \(\beta_k\) are found by computing the eigenvectors of matrices \(M_1\) and \(M_2\), where:
\[\begin{equation} \left\{\begin{matrix} M_1 = \Sigma_{11}^{-1}\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \\ M_2 = \Sigma_{22}^{-1}\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} \end{matrix}\right. \end{equation}\]R <- cov(XY)
R11 <- R[1:2,1:2] # cov(X)
R22 <- R[3:8,3:8] # cov(Y)
R21 <- R[3:8,1:2] # cov(Y,X)
R12 <- R[1:2,3:8] # cov(X,Y)
R11.inv <- solve(R11)
R22.inv <- solve(R22)
# Matrix M_1
M1 <- R11.inv %*% R12 %*% R22.inv %*% R21
# Matrix M_2
M2 <- R22.inv %*% R21 %*% R11.inv %*% R12
Non-zero eigenvalues of M1 and M2 are the same
eigen(M1)$values
## [1] 0.9592139 0.8202120
eigen(M2)$values[1:2]
## [1] 0.9592139 0.8202120
The eigenvectors of \(M_1\) and \(M_2\) are computed using eigen. However, if we use these as our canonical vectors the variances of the resulting linear combinations will not satisfy the requirement \(var(u_k) = var(v_k) = 1\).
alpha_1 <- eigen(M1)$vectors[,1]
alpha_2 <- eigen(M1)$vectors[,2]
beta_1 <- eigen(M2)$vectors[,1]
beta_2 <- eigen(M2)$vectors[,2]
# Compute scores u_k and v_k
u_1 <- X %*% alpha_1
u_2 <- X %*% alpha_2
v_1 <- Y %*% beta_1
v_2 <- Y %*% beta_2
c(var(u_1), var(u_2), var(v_1), var(v_2))
## [1] 2.0691881 0.2496118 1.5235109 0.5523109
In order to achieve the unit variance, we need to scale the canonical vectors with their corresponding standard deviations. It can be shown that the variances of the transformed variables may be expressed using the covariance matrices:
\[\begin{align}
\left\{\begin{matrix}
var(u_1) &= \alpha_1^T \Sigma_{11} \alpha_1\\
var(u_2) &= \alpha_2^T \Sigma_{11} \alpha_2\\
var(v_1) &= \beta_1^T \Sigma_{22} \beta_1\\
var(v_2) &= \beta_2^T \Sigma_{22} \beta_2
\end{matrix}\right.
\end{align}\]
Here we rescale the canonical vectors
alpha_1 <- alpha_1 / c(sqrt(alpha_1 %*% R11 %*% alpha_1))
alpha_2 <- alpha_2 / c(sqrt(alpha_2 %*% R11 %*% alpha_2))
beta_1 <- beta_1 / c(sqrt(beta_1 %*% R22 %*% beta_1))
beta_2 <- beta_2 / c(sqrt(beta_2 %*% R22 %*% beta_2))
# Re-compute scores u_k and v_k
u_1 <- X %*% alpha_1
u_2 <- X %*% alpha_2
v_1 <- Y %*% beta_1
v_2 <- Y %*% beta_2
c(var(u_1), var(u_2), var(v_1), var(v_2))
## [1] 1 1 1 1
# Score vectors:
# (Original values weighted by the scaled eigen vectors.)
eta1 <- X %*% alpha_1
eta2 <- X %*% alpha_2
fii1 <- Y %*% beta_1
fii2 <- Y %*% beta_2
# Sample canonical vectors scaled such that the score vectors have unit variance (var = 1); the scaling makes them unique up to sign.
score_matrix <- cbind(eta1, eta2, fii1, fii2)
colnames(score_matrix) <- c('eta1', 'eta2', 'fii1', 'fii2')
apply(score_matrix, 2, var)
## eta1 eta2 fii1 fii2
## 1 1 1 1
# Matrix of the canonical correlations
round(cor(score_matrix), 2)
## eta1 eta2 fii1 fii2
## eta1 1.00 0.00 0.98 0.00
## eta2 0.00 1.00 0.00 0.91
## fii1 0.98 0.00 1.00 0.00
## fii2 0.00 0.91 0.00 1.00
# Canonical correlations
# r1 = 0.98
# r2 = |-0.91| = 0.91
# The relationship between both pairs seems to be quite strong.
# Note that the canonical correlations are also the square roots of the non-zero eigenvalues of M1,M2
round(sqrt(eigen(M1)$values),2)
## [1] 0.98 0.91
# Context.
# u1 and v1:
# We call u1 the value index of the car
# and v1 the quality index of the car.
paste0(round(alpha_1,2), colnames(X), collapse = ' ')
## [1] "0.32Price -0.62Value"
paste0(round(beta_1,2), colnames(Y), collapse = ' ')
## [1] "0.43Economy -0.21Service 0Design -0.47Sport -0.22Safety -0.4Easy.h."
# Variables Price and Value have opposite signs.
# --> Variables Economy, Service, Design, Sport, Safety and Easy Handling have the opposite relation to them.
plot(eta1, fii1,
xlab="'Value index' of the car",
ylab="'Quality' of the car",
pch = 16, col = 4, cex = 1.25,
xlim = c(-4,1), ylim = c(-5,-1))
text(eta1,fii1,labels=paste(car$Type,car$Model), pos = 2)
# In the plot, x-axis:
# The most left point would be a car with Value=6 (very bad) and Price = 1 (very good)
# --> On the left we have cheap cars that lose value fast (e.g. Wartburg, Trabant)
# The most right point would be a car with Value=1 (very good) and Price = 6 (very bad)
# --> On the right we have cars that are expensive but lose value slowly (e.g. BMW, Jaguar, Ferrari)
# We interpet u1 as the value index of the car, such that high scores on the x-axis indicate that
# the car is a "worthy" investment. Note also, that Value has almost twice as much weight in the scores
# here (when compared to Price).
# 6 = very bad, 1 = very good
# v1 <- 0.43Economy - 0.21 Service +0Design - 0.47Sport - 0.22Safety - 0.4 Easy.h
#y-axis:
# Here, Economy, Sport and Easy handling have the largest weight in this component,
# desing has negligible effect on the scores.
# The uppermost points have very bad Economy and very good Service, Sport, Safety and Easy.h
# --> The uppermost cars use a lot of fuel but the other qualities are good
# (e.g. BMW, Ferrari)
# The bottommost points have very good economy but the other qualities are very bad
# --> The bottommost cars are economical but other qualities are bad.
# v1 can be interpreted as a quality index of the car, such that you get a sporty car that is easy
# to drive in the expense of economy.
# Interpret the second component similarly, might be more difficult (but not impossible). For example, Ferrari and Wartburg have similar profiles.
paste0(round(alpha_2,2), colnames(X), collapse = ' ')
## [1] "-1.41Price -1.42Value"
paste0(round(beta_2,2), colnames(Y), collapse = ' ')
## [1] "-0.46Economy -0.7Service 0.06Design 0Sport 0.3Safety -1.01Easy.h."
# Note the similar wights on Price and Value.
plot(eta2, fii2,
xlab="u2",
ylab="v2",
pch = 16, col = 4, cex = 1.25,
xlim = c(-12,-6))
text(eta2,fii2,labels = paste(car$Type,car$Model), pos = 2)
# The x-axis might reflect the average of Value and Price:
sort(rowSums(XY[,1:2]) / 2)
## VW Golf VW Passat Hyundai Opel Corsa
## 2.30 2.65 2.70 2.80
## Opel Vectra Ford Fiesta Nissan Sunny Volvo
## 2.95 3.00 3.00 3.05
## Lada Samara Mercedes 200 Audi 100 Peugeot 306
## 3.15 3.20 3.20 3.20
## Renault 19 Toyota Corolla Citroen AX Mazda 323
## 3.20 3.20 3.25 3.25
## Fiat Uno Rover Mitsubishi Galant BMW 5 series
## 3.30 3.30 3.35 3.45
## Trabant 601 Jaguar Wartburg 1.3 Ferrari
## 3.50 3.55 3.60 4.05
# Looking at the original data might be helpful:
# i.e., why Wartburg and 'Fefe' display similarities?
car[c(24,4),]
## Type Model Economy Service Value Price Design Sport Safety Easy.h.
## 24 Wartburg 1.3 3.7 4.7 5.5 1.7 4.8 5.2 5.5 4.0
## 4 Ferrari 5.3 2.9 2.2 5.9 1.7 1.1 3.3 4.3
# Or why VW's are similar
car[c(22,23),]
## Type Model Economy Service Value Price Design Sport Safety Easy.h.
## 22 VW Golf 2.4 2.1 2.0 2.6 3.2 3.1 3.1 1.6
## 23 VW Passat 3.1 2.2 2.1 3.2 3.5 3.5 2.8 1.8