Problem 1: Robust PCA

The data set in this exercise is a modified version from Draper and Smith (1966) and it was used to determine the influence of anatomical factors on wood specific gravity, with five explanatory variables. The data is contaminated by replacing a few observations with outliers.


(a) Install the package rrcov. Upload the data “wood.txt” into your R-workspace.

#install.packages("rrcov")
library(rrcov)
D <- read.table("wood.txt", header=TRUE, sep="\t")



b) Plot the variables pairwise. Can you detect any outliers?

pairs(D, pch = 19, lower.panel = NULL, col = "blue")



c) Estimate the covariance matrix using the regular sample covariance matrix and using the Minimum Covariance Determinant (MCD) method. Are there differences between the estimates?

regular_covm <- cov(D)
regular_covm
##             X1          X2           X3           X4           X5
## X1  3.80227921 -0.90443869  0.078464066  0.060427001 -0.131939002
## X2 -0.90443869  9.06329355 -0.044136873 -0.112919979  0.180129369
## X3  0.07846407 -0.04413687  0.004243031  0.001595618 -0.003700091
## X4  0.06042700 -0.11291998  0.001595618  0.003984759 -0.002398752
## X5 -0.13193900  0.18012937 -0.003700091 -0.002398752  0.010638865
MCD_covm <- CovMcd(D,alpha=1/2) # see comments below
MCD_covm@cov
##             X1           X2           X3           X4           X5
## X1  6.49939325  8.256688895  0.097614189 -0.016889835 -0.075246561
## X2  8.25668889 14.441998803  0.270597883 -0.012891183  0.009801839
## X3  0.09761419  0.270597883  0.010140643 -0.001317422  0.000815357
## X4 -0.01688983 -0.012891183 -0.001317422  0.004319414  0.004491941
## X5 -0.07524656  0.009801839  0.000815357  0.004491941  0.010669458

In the code block above the “alpha” is a numeric parameter controlling the size of the subsets over which the determinant is minimized, i.e., alpha*n observations are used for computing the determinant. Allowed values are between 0.5 and 1 and the default is 0.5.


(d) Calculate the robust and regular Mahalanobis distances for the data set and try to identify the outliers.

Mahalanobis distance is defined as: \[ \begin{aligned} D_{M}(x)=\sqrt{(x-T(x))^{T}S^{-1}(x)(x-T(x))} \end{aligned} \] In the above \(T(x)\) is an estimator of location and \(S(·)\) is an estimator of scatter. By choosing \(T\) as the sample mean and \(S\) as the regular covariance matrix , we get the regular Mahalanobis distances. Likewise, by choosing the MCD as \(S\) and the corresponding estimator of location as \(T\), we get the robust version.

s.maha <- mahalanobis(D, center=colMeans(D), cov=regular_covm)
r.maha <- mahalanobis(D, center=MCD_covm@center, cov=MCD_covm@cov)

plot(c(1,nrow(D)),range(sqrt(c(s.maha,r.maha))),type="n",
     xlab="Observation",ylab="Mahal. distance")

points(1:nrow(D),sqrt(s.maha),col="green",pch=16)

points(1:nrow(D),sqrt(r.maha),col="blue",pch=16)


legend("topleft",col=c("green","blue"),cex=0.8,legend=c("Regular","MCD"),
       pch=c(16,16))

Clearly, when using MCD we see some potential outliers, however, using classical covariance the outliers are not visible.



(e) Assume that the original data is normally distributed. Perform the PCA transformation using the regular covariance matrix and MCD. Should we use covariance or correlation based PCA? Compare the loadings of the different approaches. Plot the components of the score matrix pairwise.

apply(D,2,range)
##            X1       X2        X3        X4       X5
## [1,] 20.32240 32.06244 0.6898171 0.6959214 1.122324
## [2,] 26.51415 41.07311 0.9179452 0.8983566 1.481234

It seems that the scales of the variables are different. Also, we do not know if the variables are of the same units. Hereby, we use the correlation based PCA. The standard PCA is computed using princomp and the MCD PCA using PcaCov (scale=TRUE means that we perform the correlation based PCA).

PCA1 <- princomp(D, cor=T)
PCA2 <- PcaCov(D, scale=TRUE, cov.control = CovControlMcd(alpha=1/2)) 
PCA1$loadings
## 
## Loadings:
##    Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## X1  0.468  0.455  0.168  0.540  0.504
## X2 -0.386  0.696  0.264  0.153 -0.523
## X3  0.439  0.428        -0.788       
## X4  0.436 -0.351  0.720        -0.404
## X5 -0.500         0.618 -0.243  0.555
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## SS loadings       1.0    1.0    1.0    1.0    1.0
## Proportion Var    0.2    0.2    0.2    0.2    0.2
## Cumulative Var    0.2    0.4    0.6    0.8    1.0
PCA2@loadings
##           PC1        PC2         PC3        PC4         PC5
## X1  0.5621432 0.04329589 -0.55516799  0.1494159 -0.59294506
## X2  0.6027304 0.25366576 -0.08777107  0.2034274  0.72338319
## X3  0.4864497 0.19502242  0.66077586 -0.4693703 -0.26153331
## X4 -0.2230827 0.64331581 -0.41850979 -0.5960536  0.07712665
## X5 -0.1852343 0.69418310  0.26890437  0.6005862 -0.22535466

We notice that the loadings are different implying different interpretations for the principal components.

pairs(PCA1$scores, pch = 19, lower.panel = NULL, col = "blue")

pairs(PCA2@scores, pch = 19, lower.panel = NULL, col = "blue")





Problem 2: Estimators of scatter

Simulate 200 observations from the following bivariate distributions:

  1. Bivariate standard normal distribution.

  2. Bivariate t-distribution where the degree of freedom is 5 and the scale matrix is an identity matrix.

  3. Bivariate distribution where the first component follows the Weibull distribution with parameters \(a\) = 1 and \(b\) = 2 and the second component follows the Gamma distribution with the parameters \(\alpha\) = 2 and \(\beta\) = 1.

Visualize the data.

Estimate the MCD and the regular sample covariance for the simulated data in (a) - (c). Compare the estimates.

library(mvtnorm)
set.seed(123)
n <- 200



Bivariate standard normal distribution

D1 <- rmvnorm(n,mean=c(0,0),sigma=diag(2))
plot(D1, pch = 19, col="blue")

cov1  <- cov(D1)
rcov1 <- CovMcd(D1,alpha=1/2)@cov
cov1
##            [,1]       [,2]
## [1,]  0.8861715 -0.1130047
## [2,] -0.1130047  0.9915206
rcov1
##            [,1]       [,2]
## [1,]  0.9196207 -0.1551592
## [2,] -0.1551592  0.9580028



Bivariate t-distribution

D2 <- rmvt(n,df=5,sigma=diag(2))
plot(D2, pch = 19, col="blue")

cov2 <- cov(D2)
rcov2 <- CovMcd(D2,alpha=1/2)@cov
cov2
##            [,1]       [,2]
## [1,]  1.7892456 -0.1774634
## [2,] -0.1774634  1.2873681
rcov2
##            [,1]       [,2]
## [1,]  1.4039926 -0.2008512
## [2,] -0.2008512  1.2986795



Bivariate Weibull-Gamma distribution

x1 <- rweibull(n,shape=1,scale=2)
x2 <- rgamma(n,shape=2,scale=1)
D3 <- cbind(x1,x2)
plot(D3, pch = 19, col="blue")

cov3 <- cov(D3)
rcov3 <- CovMcd(D3,alpha=1/2)@cov
cov3
##           x1        x2
## x1 3.3691638 0.3967598
## x2 0.3967598 1.9398810
rcov3
##            x1         x2
## x1 2.26313959 0.01216148
## x2 0.01216148 1.07616204