Exercise 7

Problem 1.

“Correspondence analysis is a method of data analysis for representing tabular data graphically. Correspondence analysis is a generalization of a simple graphical concept with which we are all familiar, namely the scatterplot. -Micahel Greenarce, Correspondence Analysis in Practice”

The aim of the MCA is to produce graphical display in lower dimension which represents, without losing too much information, the associations between the modalities through the attraction repulsion index.

Install the package ca. The data set TEA.txt contains answers of a questionnaire on tea consumption for 300 individuals. The following questions were asked:

What kind of tea do you drink? (black, green, favored)
How do you drink it? (alone, w/milk, w/lemon, other)
What kind of presentation do you buy? (tea bags, loose tea, both)
Do you add sugar? (yes, no)
Where do you buy it? (supermarket, tea shops, both)
Do you always drink tea? (always, not always)

1. Perform correspondence analysis to the data set and interpret the results.

The function mjca computes a multiple or joint correspondence analysis based on the eigenvalue decomposition of the Burt matrix. The lambda option selects the scaling variant desired for reporting inertias. \(lambda="indicator"\) gives multiple correspondence analysis based on the correspondence analysis of the indicator matrix, with corresponding inertias (eigenvalues) as presented in lecture slides.

# Argument 'lambda' controls the scaling method, "indicator" is as presented in the lecture slides.
# Argument 'reti' control whether the indicator matrix should be returned or not; let's set it to TRUE
tea.mca <- mjca(tea, lambda = "indicator", reti = T)

2. Overview of the mjca object, and some key details

You can get an overview of values stored in the object by calling name(). Explanations and additional details on these values may be obtained by calling help(mjca).

# Items saved into the mjca object:
names(tea.mca)

##  [1] "sv"         "lambda"     "inertia.e"  "inertia.t"  "inertia.et"
##  [6] "levelnames" "factors"    "levels.n"   "nd"         "nd.max"    
## [11] "rownames"   "rowmass"    "rowdist"    "rowinertia" "rowcoord"  
## [16] "rowpcoord"  "rowctr"     "rowcor"     "colnames"   "colmass"   
## [21] "coldist"    "colinertia" "colcoord"   "colpcoord"  "colctr"    
## [26] "colcor"     "colsup"     "subsetcol"  "Burt"       "Burt.upd"  
## [31] "subinertia" "JCA.iter"   "indmat"     "call"

# Variables and variable categories (factors, levels):
tea.mca$factors

##         factor   level                 
## Tea1    "Tea"    "black"               
## Tea2    "Tea"    "Earl Grey"           
## Tea3    "Tea"    "green"               
## How1    "How"    "alone"               
## How2    "How"    "lemon"               
## How3    "How"    "milk"                
## How4    "How"    "other"               
## how1    "how"    "tea bag"             
## how2    "how"    "tea bag+unpackaged"  
## how3    "how"    "unpackaged"          
## sugar1  "sugar"  "No.sugar"            
## sugar2  "sugar"  "sugar"               
## where1  "where"  "chain store"         
## where2  "where"  "chain store+tea shop"
## where3  "where"  "tea shop"            
## always1 "always" "always"              
## always2 "always" "Not.always"

# Singular values (eigen values, since lambda = 'indicator'):
tea.mca$sv^2

##  [1] 0.27976178 0.25774772 0.22013794 0.18792961 0.16876495 0.16368666
##  [7] 0.15288834 0.13838682 0.11569167 0.08612637 0.06221147

# Cumulative proportion of variation explained by k components:
cumsum(tea.mca$sv^2 / sum(tea.mca$sv^2))

##  [1] 0.1525973 0.2931870 0.4132622 0.5157693 0.6078229 0.6971065 0.7805002
##  [8] 0.8559839 0.9190885 0.9660665 1.0000000

3. The summary and analysis of the MCA results

# Mass: mass (weight) of the point
# QLT: quality of representation of the variable:category
# INR: The inertia of the point
# k=1: Principal coordinates of the first dimension
# COR: The relative contribution of the principal axis to the point inertia
# CTR: The absolute contribution of the point to the inertia of the axis
S <- summary(tea.mca)
barplot(data.frame(S$scree)[,3], ylim = c(0,20), names.arg = paste('PC',1:11), main = 'Screeplot', las = 2, xlab = 'Component', ylab = '% of variation explained');box()

S$columns

##                          name mass  qlt  inr   k=1 cor ctr   k=2 cor ctr
## 1                   Tea:black   41   72   66  -446  65  29   143   7   3
## 2               Tea:Earl Grey  107  135   32   250 113  24   111  22   5
## 3                   Tea:green   18  144   74  -464  27  14  -974 117  67
## 4                   How:alone  108  118   30    22   1   0  -251 117  27
## 5                   How:lemon   18   84   75  -682  58  31   464  27  15
## 6                    How:milk   35   43   64   331  29  14   229  14   7
## 7                   How:other    5  144   82  -289   3   1  2141 142  89
## 8                 how:tea bag   94  639   45   616 497 128  -329 142  40
## 9      how:tea bag+unpackaged   52  519   66  -371  63  26  1001 457 203
## 10             how:unpackaged   20  667   92 -1943 515 270 -1057 152  87
## 11             sugar:No.sugar   86   62   41  -238  60  17    40   2   1
## 12                sugar:sugar   81   62   44   254  60  19   -42   2   1
## 13          where:chain store  107  715   38   533 506 108  -343 209  49
## 14 where:chain store+tea shop   43  705   75  -481  81  36  1333 624 299
## 15             where:tea shop   17  699   95 -2164 520 279 -1269 179 104
## 16              always:always   57   13   53  -109   6   2   118   7   3
## 17          always:Not.always  109   13   28    57   6   1   -62   7   2

4. Visualization and interpretation of the results

Between categories:

Less than 90 degrees = attraction
More than 90 degrees = repulsion
90 degrees = independent

Between individuals:

Less than 90 degrees = similar profile
More than 90 degrees = profile differs

plot(tea.mca, arrows = c(T,T))

Note that, the figure is messy and typically it is better to save high-resolution images with ‘pdf’ function for example: pdf(‘filename.pdf’, width = width_in_incehes, height = height_in_inches)

5. Note how the MCA is essentially CA conducted using the indicator matrix

plot(ca(tea.mca$indmat),
     arrows = c(F,T), map = 'symmetric', what = c('none','all'))

6. The plot can be done manually as well to increase tuning possibilities

# Obtain the principal coordinates scaled with the corresponding eigen value:
tea.covariates <- sweep(tea.mca$colcoord[,1:2],2, tea.mca$sv[1:2],'*')
# Make a normalization function, i.e., to scale values from 0 to 1 (this is for visulization purposes):
normalize <- function(x) {
  
  (x - min(x)) / (max(x) - min(x))
}
# Save the mjca summary object:
a <- summary(tea.mca)
# Generate the scatter plot. Note how the point size is now scaled according to quality of representation:
plot(tea.covariates, xlim = c(-2.5,1), ylim = c(-1.5,2.5), pch = 16, col = 2,
     cex = normalize(a$columns$` qlt`) + 1)
# Add arrows. Slight transparency is added to increase visibility.
arrows(rep(0,17), rep(0,17), tea.covariates[,1], tea.covariates[,2],
       length = 0.05, col = rgb(1,0,0,0.25))
# "Cross-hair" is added, i.e., dotted lines crossing x and y axis at 0.
abline(h = 0, v = 0, lty = 3)
# Add variable:category names to the plot.
text(tea.covariates, tea.mca$levelnames, pos = 2, cex = 0.75)

Problem 2.

\[ \begin{aligned} V=T^TT,\quad where\enspace T_{i,pl}=\frac{x_{ipl} - n_{pl}/n}{\sqrt{Pn_{pl}}} \end{aligned} \]

# T matrix:
n_pl <- unlist(apply(tea,2,table))
x_ipl <- tea.mca$indmat
n <- nrow(tea)
P <- ncol(tea)
K <- length(n_pl)

T_mat <- (x_ipl - matrix(rep(n_pl,n), ncol = 17, byrow = T)/n) /
  matrix(rep(sqrt(P*n_pl),n), ncol = 17, byrow = T)
V_mat <- t(T_mat) %*% T_mat

# Trace(V) = K/P - 1 = sum(T^2)
sum(diag(V_mat))

## [1] 1.833333

K/P - 1

## [1] 1.833333

sum(T_mat^2)

## [1] 1.833333

# Non-zero eigen values of V are the same as the eigen values returned by 'mjca' function:
eigen(V_mat)$values[1:11]

##  [1] 0.27976178 0.25774772 0.22013794 0.18792961 0.16876495 0.16368666
##  [7] 0.15288834 0.13838682 0.11569167 0.08612637 0.06221147

tea.mca$sv^2

##  [1] 0.27976178 0.25774772 0.22013794 0.18792961 0.16876495 0.16368666
##  [7] 0.15288834 0.13838682 0.11569167 0.08612637 0.06221147

# And, the trace(V) = sum of the eigen values of V:
sum(eigen(V_mat)$values[1:11])

## [1] 1.833333