The data set SCIENCEDOCTORATES.txt contains the number of doctors grad- uated from different fields of science. The data is from USA between the years 1960-1975. Find the attraction repulsion matrix and interpret the results.
Solution:
Consider a sample of size \(n\) described by two variables, \(x\) with categories \(A_1,...,A_J\) and \(y\) with categories \(B_1,...,B_K\). Let \(n_{jk}\) denote the number of observations having categories \(A_j\) and \(B_k\). Let \(f_{jk} = \frac{n_{jk}}{n}\) denote the relative frequency corresponding to \(n_{jk}\). The elements of the attraction repulsion matrix D are then given by:
\[
\begin{aligned}
d_{jk}=\frac{n_{jk}}{n_{jk}^{*}} = \frac{f_{jk}}{f_{jk}^{*}}
\end{aligned}
\] where \[
\begin{aligned}
n_{jk}^{*}=\frac{n_{j.}n_{.k}}{n} \hspace{1cm} f_{jk}^{*}=f_{j.} \ f_{.k}
\end{aligned}
\] From the above we see that the elements of \(D\) may be computed using \((i)\) theoretical frequencies or \((ii)\) relative frequencies under the assumption of independence.
1. Contingency Table:
data <- read.table("SCIENCEDOCTORATES.txt", header=T, row.names=1)
s <- as.matrix(data[-13,-9])
s
## Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering 794 2073 3432 3495 3475 3338 3144 2959
## Mathematics 291 685 1222 1236 1281 1222 1196 1149
## Physics 530 1046 1655 1740 1635 1590 1334 1293
## Chemistry 1078 1444 2234 2204 2011 1849 1792 1762
## EarthSciences 253 375 511 550 580 577 570 556
## Biology 1245 1963 3360 3633 3580 3636 3473 3498
## Agriculture 414 576 803 900 855 853 830 904
## Psychology 772 954 1888 2116 2262 2444 2587 2749
## Sociology 162 239 504 583 638 599 645 680
## Economics 341 538 826 791 863 907 833 867
## Anthropology 69 82 217 240 260 324 381 385
## Others 314 502 1079 1392 1500 1609 1531 1550
2. Observed relative frequencies:
F <- prop.table(s)
round(F, 2)
## Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering 0.01 0.02 0.03 0.03 0.03 0.03 0.02 0.02
## Mathematics 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01
## Physics 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01
## Chemistry 0.01 0.01 0.02 0.02 0.02 0.01 0.01 0.01
## EarthSciences 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Biology 0.01 0.02 0.03 0.03 0.03 0.03 0.03 0.03
## Agriculture 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01
## Psychology 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02
## Sociology 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01
## Economics 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01
## Anthropology 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Others 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01
3. Row profiles:
rowp <- prop.table(s, 1)
round(rowp, 2)
## Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering 0.03 0.09 0.15 0.15 0.15 0.15 0.14 0.13
## Mathematics 0.04 0.08 0.15 0.15 0.15 0.15 0.14 0.14
## Physics 0.05 0.10 0.15 0.16 0.15 0.15 0.12 0.12
## Chemistry 0.07 0.10 0.16 0.15 0.14 0.13 0.12 0.12
## EarthSciences 0.06 0.09 0.13 0.14 0.15 0.15 0.14 0.14
## Biology 0.05 0.08 0.14 0.15 0.15 0.15 0.14 0.14
## Agriculture 0.07 0.09 0.13 0.15 0.14 0.14 0.14 0.15
## Psychology 0.05 0.06 0.12 0.13 0.14 0.15 0.16 0.17
## Sociology 0.04 0.06 0.12 0.14 0.16 0.15 0.16 0.17
## Economics 0.06 0.09 0.14 0.13 0.14 0.15 0.14 0.15
## Anthropology 0.04 0.04 0.11 0.12 0.13 0.17 0.19 0.20
## Others 0.03 0.05 0.11 0.15 0.16 0.17 0.16 0.16
4. Column profiles:
colp <- prop.table(s, 2)
round(colp, 2)
## Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering 0.13 0.20 0.19 0.19 0.18 0.18 0.17 0.16
## Mathematics 0.05 0.07 0.07 0.07 0.07 0.06 0.07 0.06
## Physics 0.08 0.10 0.09 0.09 0.09 0.08 0.07 0.07
## Chemistry 0.17 0.14 0.13 0.12 0.11 0.10 0.10 0.10
## EarthSciences 0.04 0.04 0.03 0.03 0.03 0.03 0.03 0.03
## Biology 0.20 0.19 0.19 0.19 0.19 0.19 0.19 0.19
## Agriculture 0.07 0.05 0.05 0.05 0.05 0.05 0.05 0.05
## Psychology 0.12 0.09 0.11 0.11 0.12 0.13 0.14 0.15
## Sociology 0.03 0.02 0.03 0.03 0.03 0.03 0.04 0.04
## Economics 0.05 0.05 0.05 0.04 0.05 0.05 0.05 0.05
## Anthropology 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02
## Others 0.05 0.05 0.06 0.07 0.08 0.08 0.08 0.08
6. Attraction Repulsion Matrix
v1 <- margin.table(s,1)
v2 <- margin.table(s,2)
V1 <- matrix(v1,ncol=1)
V2 <- matrix(v2,nrow=1)
# Expected number of observations under independence
E <- V1 %*% V2 /sum(s) # n_{jk}^{*} from above!
# We obtain attraction repulsion matrix D
# simply dividing each n_{jk} by n_{jk}^{*}
D <- s/E
# Values near 1: The year and science are independent
# Values < 1: The science is less frequent in that specific year
# Values > 1: The science is more frequent in that specific year
round(D, 2)
## Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering 0.71 1.11 1.09 1.04 1.03 0.99 0.97 0.91
## Mathematics 0.72 1.01 1.06 1.01 1.04 1.00 1.01 0.97
## Physics 1.00 1.18 1.10 1.09 1.02 0.99 0.86 0.83
## Chemistry 1.53 1.23 1.12 1.04 0.94 0.87 0.87 0.85
## EarthSciences 1.30 1.15 0.93 0.94 0.99 0.98 1.00 0.98
## Biology 1.04 0.98 0.99 1.01 0.99 1.01 0.99 1.00
## Agriculture 1.38 1.15 0.94 0.99 0.94 0.94 0.94 1.03
## Psychology 1.00 0.74 0.86 0.91 0.97 1.05 1.15 1.21
## Sociology 0.82 0.72 0.90 0.98 1.06 1.00 1.11 1.17
## Economics 1.17 1.10 1.00 0.90 0.98 1.03 0.98 1.01
## Anthropology 0.72 0.51 0.80 0.83 0.90 1.12 1.36 1.37
## Others 0.68 0.65 0.82 1.00 1.07 1.15 1.13 1.14
1. Contingency Table:
data <- read.table("SALARY.txt", header=T, row.names=1)
s <- as.matrix(data)
s
## B1 B2 B3
## A1 150 40 10
## A2 190 350 60
## A3 10 110 80
2. Observed relative frequencies:
F <- prop.table(s)
round(F, 2)
## B1 B2 B3
## A1 0.15 0.04 0.01
## A2 0.19 0.35 0.06
## A3 0.01 0.11 0.08
3. Row profiles:
rowp <- prop.table(s, 1)
round(rowp, 2)
## B1 B2 B3
## A1 0.75 0.20 0.05
## A2 0.32 0.58 0.10
## A3 0.05 0.55 0.40
4. Column profiles:
colp <- prop.table(s, 2)
round(colp, 2)
## B1 B2 B3
## A1 0.43 0.08 0.07
## A2 0.54 0.70 0.40
## A3 0.03 0.22 0.53
6. Attraction Repulsion Matrix
# | B1 B2 B3 | v1
# -------------------|-----
# A1 | 150 40 10 | 200
# A2 | 190 350 60 | 600
# A3 | 10 110 80 | 200
# -------------------------
# v2 | 350 500 150 | 1000
v1 <- margin.table(s,1)
v2 <- margin.table(s,2)
V1 <- matrix(v1,ncol=1)
V2 <- matrix(v2,nrow=1)
E <- V1 %*% V2 /sum(s) # n_{jk}^{*} from above!
# We obtain D simply dividing each n_{jk} by n_{jk}^{*}
D <- s/E
round(D, 2)
## B1 B2 B3
## A1 2.14 0.40 0.33
## A2 0.90 1.17 0.67
## A3 0.14 1.10 2.67
7. Attraction Repulsion Matrix – Visualization (PhD data)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(reshape)
# ggplot version:
melted <- melt(D_phd)
head(melted)
## X1 X2 value
## 1 Engineering Y1960 0.7140280
## 2 Mathematics Y1960 0.7175789
## 3 Physics Y1960 1.0000924
## 4 Chemistry Y1960 1.5316270
## 5 EarthSciences Y1960 1.3008379
## 6 Biology Y1960 1.0425696
range(melted$value)
## [1] 0.5112797 1.5316270
ggplot(melted, aes(x=X1, y=X2, fill=value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = median(melted$value), limit = c(0.5, 1.55),
name="AR value") +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 9, hjust = 1)) +
coord_fixed() +
ggtitle(label = c('Heatmap representation of the Attraction Repulsion Matrix')) +
xlab('Science') +
ylab('Year')