Install the package MASS and use the function lda() to perform Fisher’s linear discriminant analysis to the Fisher’s iris data set. The data set can be accessed with the command: data(iris). It contains measurements of sepal width, sepal length, pedal width and pedal length. In your analysis, consider the two species that are the most difficult to separate: “versicolor” and “virginica”, i.e. leave the first 50 observations out.
Use lda() to find the vector a. Verify that a is equal to the eigenvector of \(W^{−1}B\) that corresponds to the largest eigenvalue.
Suppose we have a new flower with the following measurements: (sepal length, sepal width, petal length, petal width) = (6,3,4,1). In which group will this flower be classified?
Use the leave-one-out method to determine the missclassification rate.
Import MASS library and Iris data
library(MASS)
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Drop first 50 observations
iris <- iris[51:150,]
Is it possible to linearly separate different species?
cols <- rep(NA,100)
cols[iris$Species == "versicolor"] <- "blue"
cols[iris$Species == "virginica"] <- "red"
pairs(iris[,1:4],col=cols,pch=16)
LDA
iris[,5] <- droplevels(iris[,5])
iris.lda <- lda(Species~. ,data = iris)
Verify that it is equal to the eigenvector of \(W^{−1}B\) The matrix \(W\) measures within group dispersions and the matrix \(B\) measures dispersion between groups. The general forms for these matrices are:
\[\begin{equation}
W = \sum_{i=1}^{g}(n_i-1)cov(X_i)
\end{equation}\]
\[\begin{equation}
B = \sum_{i=1}^{g}n_i(\bar{x_i} - \bar{x})(\bar{x_i} - \bar{x})^T
\end{equation}\]
when \(g=2\) the matrix \(B\) may be written as:
\[\begin{equation}
B = \frac{n_1 n_2}{n}d d^T
\end{equation}\]
where \(d = \bar{x_1} - \bar{x_2}\). The matrix a is then obtained computing
\[\begin{equation}
a = W^{-1}d
\end{equation}\]
vector a computed by R
a <- iris.lda$scaling
a
## LD1
## Sepal.Length -0.9431178
## Sepal.Width -1.4794287
## Petal.Length 1.8484510
## Petal.Width 3.2847304
Manual computation
# cov(X_1)
S1 <- cov(iris[1:50,1:4])
# cov(X_2)
S2 <- cov(iris[51:100,1:4])
d1 <- colMeans(iris[1:50,1:4])
d2 <- colMeans(iris[51:100,1:4])
d <- d1 - d2
d = as.matrix(d,ncol=1)
B <- (50*50)/100 * d %*% t(d)
W <- 49*(S1 + S2)
L <- solve(W) %*% B
# Eigenvectors of L
a2 <- eigen(L)$vectors[,1]
e <- eigen(L)$values[1]
# Note that eigenvectors are not unique and
norm(a,type="2") # not 1
## [1] 4.157452
norm(a2,type="2") # equal to 1
## [1] 1
# Scale a to have length 1:
a.scaled <- a/norm(a,type="2")
a.scaled
## LD1
## Sepal.Length -0.2268500
## Sepal.Width -0.3558499
## Petal.Length 0.4446115
## Petal.Width 0.7900826
a2
## [1] -0.2268500+0i -0.3558499+0i 0.4446115+0i 0.7900826+0i
Add a new data point and classify it using LDA
#new observation = (6, 3, 4, 1)
newdat <- data.frame(Sepal.Length=6,Sepal.Width=3,Petal.Length=4,
Petal.Width=1)
predict(iris.lda,newdata=newdat)$class # New flower classified as versicolor
## [1] versicolor
## Levels: versicolor virginica
Compute the misclassification rate
#(c)
D.cv <- lda(Species~. ,data=iris,CV=T) #Results for leave-one out cross-validation
result <- data.frame(Est=D.cv$class,Truth=iris[,5])
tab <- table(result) #truth table, here 2 versicolor are classified as virginica and
# 1 virginica is classified as versicolor
tab
## Truth
## Est versicolor virginica
## versicolor 48 1
## virginica 2 49
# Hereby, the missclassification rate is
3/100
## [1] 0.03