Problem 1: Linear Discriminant Analysis

Install the package MASS and use the function lda() to perform Fisher’s linear discriminant analysis to the Fisher’s iris data set. The data set can be accessed with the command: data(iris). It contains measurements of sepal width, sepal length, pedal width and pedal length. In your analysis, consider the two species that are the most difficult to separate: “versicolor” and “virginica”, i.e. leave the first 50 observations out.

Use lda() to find the vector a. Verify that a is equal to the eigenvector of \(W^{−1}B\) that corresponds to the largest eigenvalue.
Suppose we have a new flower with the following measurements: (sepal length, sepal width, petal length, petal width) = (6,3,4,1). In which group will this flower be classified?
Use the leave-one-out method to determine the missclassification rate.

Import MASS library and Iris data

library(MASS)

data(iris)

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Drop first 50 observations

iris <- iris[51:150,]

Is it possible to linearly separate different species?

cols <- rep(NA,100)

cols[iris$Species == "versicolor"] <- "blue"
cols[iris$Species == "virginica"] <- "red"
pairs(iris[,1:4],col=cols,pch=16)

LDA

iris[,5] <- droplevels(iris[,5]) 
iris.lda <- lda(Species~. ,data = iris)

Verify that it is equal to the eigenvector of \(W^{−1}B\) The matrix \(W\) measures within group dispersions and the matrix \(B\) measures dispersion between groups. The general forms for these matrices are: \[\begin{equation} W = \sum_{i=1}^{g}(n_i-1)cov(X_i) \end{equation}\] \[\begin{equation} B = \sum_{i=1}^{g}n_i(\bar{x_i} - \bar{x})(\bar{x_i} - \bar{x})^T \end{equation}\] when \(g=2\) the matrix \(B\) may be written as: \[\begin{equation} B = \frac{n_1 n_2}{n}d d^T \end{equation}\] where \(d = \bar{x_1} - \bar{x_2}\). The matrix a is then obtained computing \[\begin{equation} a = W^{-1}d \end{equation}\]

vector a computed by R

a <- iris.lda$scaling
a

##                     LD1
## Sepal.Length -0.9431178
## Sepal.Width  -1.4794287
## Petal.Length  1.8484510
## Petal.Width   3.2847304

Manual computation

# cov(X_1)
S1 <- cov(iris[1:50,1:4])

# cov(X_2)
S2 <- cov(iris[51:100,1:4])


d1 <- colMeans(iris[1:50,1:4])
d2 <- colMeans(iris[51:100,1:4])

d <- d1 - d2
d = as.matrix(d,ncol=1)

B <- (50*50)/100 * d %*% t(d)
W <- 49*(S1 + S2)
L <- solve(W) %*% B

# Eigenvectors of L
a2 <- eigen(L)$vectors[,1]

e <- eigen(L)$values[1]

# Note that eigenvectors are not unique and
norm(a,type="2") # not 1

## [1] 4.157452

norm(a2,type="2") # equal to 1

## [1] 1

# Scale a to have length 1:
a.scaled <- a/norm(a,type="2")

a.scaled

##                     LD1
## Sepal.Length -0.2268500
## Sepal.Width  -0.3558499
## Petal.Length  0.4446115
## Petal.Width   0.7900826

a2

## [1] -0.2268500+0i -0.3558499+0i  0.4446115+0i  0.7900826+0i

Add a new data point and classify it using LDA

#new observation = (6, 3, 4, 1)

newdat <- data.frame(Sepal.Length=6,Sepal.Width=3,Petal.Length=4,
                     Petal.Width=1)

predict(iris.lda,newdata=newdat)$class # New flower classified as versicolor

## [1] versicolor
## Levels: versicolor virginica

Compute the misclassification rate

#(c)

D.cv <- lda(Species~. ,data=iris,CV=T) #Results for leave-one out cross-validation

result <- data.frame(Est=D.cv$class,Truth=iris[,5])

tab <- table(result) #truth table, here 2 versicolor are classified as virginica and
# 1 virginica is classified as versicolor
tab

##             Truth
## Est          versicolor virginica
##   versicolor         48         1
##   virginica           2        49

# Hereby, the missclassification rate is
3/100

## [1] 0.03

Juha Hellen

3/12/2020

Problem 1: Linear Discriminant Analysis