Problem 1: Frequency Analysis

The data set SCIENCEDOCTORATES.txt contains the number of doctors grad- uated from different fields of science. The data is from USA between the years 1960-1975. Find the attraction repulsion matrix and interpret the results.

Solution:

Consider a sample of size \(n\) described by two variables, \(x\) with categories \(A_1,...,A_J\) and \(y\) with categories \(B_1,...,B_K\). Let \(n_{jk}\) denote the number of observations having categories \(A_j\) and \(B_k\). Let \(f_{jk} = \frac{n_{jk}}{n}\) denote the relative frequency corresponding to \(n_{jk}\). The elements of the attraction repulsion matrix D are then given by:

\[ \begin{aligned} d_{jk}=\frac{n_{jk}}{n_{jk}^{*}} = \frac{f_{jk}}{f_{jk}^{*}} \end{aligned} \] where \[ \begin{aligned} n_{jk}^{*}=\frac{n_{j.}n_{.k}}{n} \hspace{1cm} f_{jk}^{*}=f_{j.} \ f_{.k} \end{aligned} \] From the above we see that the elements of \(D\) may be computed using \((i)\) theoretical frequencies or \((ii)\) relative frequencies under the assumption of independence.


1. Contingency Table:

data <- read.table("SCIENCEDOCTORATES.txt", header=T, row.names=1)
s <- as.matrix(data[-13,-9])
s
##               Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering     794  2073  3432  3495  3475  3338  3144  2959
## Mathematics     291   685  1222  1236  1281  1222  1196  1149
## Physics         530  1046  1655  1740  1635  1590  1334  1293
## Chemistry      1078  1444  2234  2204  2011  1849  1792  1762
## EarthSciences   253   375   511   550   580   577   570   556
## Biology        1245  1963  3360  3633  3580  3636  3473  3498
## Agriculture     414   576   803   900   855   853   830   904
## Psychology      772   954  1888  2116  2262  2444  2587  2749
## Sociology       162   239   504   583   638   599   645   680
## Economics       341   538   826   791   863   907   833   867
## Anthropology     69    82   217   240   260   324   381   385
## Others          314   502  1079  1392  1500  1609  1531  1550

2. Observed relative frequencies:

F <- prop.table(s)
round(F, 2)
##               Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering    0.01  0.02  0.03  0.03  0.03  0.03  0.02  0.02
## Mathematics    0.00  0.01  0.01  0.01  0.01  0.01  0.01  0.01
## Physics        0.00  0.01  0.01  0.01  0.01  0.01  0.01  0.01
## Chemistry      0.01  0.01  0.02  0.02  0.02  0.01  0.01  0.01
## EarthSciences  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
## Biology        0.01  0.02  0.03  0.03  0.03  0.03  0.03  0.03
## Agriculture    0.00  0.00  0.01  0.01  0.01  0.01  0.01  0.01
## Psychology     0.01  0.01  0.01  0.02  0.02  0.02  0.02  0.02
## Sociology      0.00  0.00  0.00  0.00  0.00  0.00  0.01  0.01
## Economics      0.00  0.00  0.01  0.01  0.01  0.01  0.01  0.01
## Anthropology   0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
## Others         0.00  0.00  0.01  0.01  0.01  0.01  0.01  0.01

3. Row profiles:

rowp <- prop.table(s, 1)
round(rowp, 2)
##               Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering    0.03  0.09  0.15  0.15  0.15  0.15  0.14  0.13
## Mathematics    0.04  0.08  0.15  0.15  0.15  0.15  0.14  0.14
## Physics        0.05  0.10  0.15  0.16  0.15  0.15  0.12  0.12
## Chemistry      0.07  0.10  0.16  0.15  0.14  0.13  0.12  0.12
## EarthSciences  0.06  0.09  0.13  0.14  0.15  0.15  0.14  0.14
## Biology        0.05  0.08  0.14  0.15  0.15  0.15  0.14  0.14
## Agriculture    0.07  0.09  0.13  0.15  0.14  0.14  0.14  0.15
## Psychology     0.05  0.06  0.12  0.13  0.14  0.15  0.16  0.17
## Sociology      0.04  0.06  0.12  0.14  0.16  0.15  0.16  0.17
## Economics      0.06  0.09  0.14  0.13  0.14  0.15  0.14  0.15
## Anthropology   0.04  0.04  0.11  0.12  0.13  0.17  0.19  0.20
## Others         0.03  0.05  0.11  0.15  0.16  0.17  0.16  0.16

4. Column profiles:

colp <- prop.table(s, 2)
round(colp, 2)
##               Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering    0.13  0.20  0.19  0.19  0.18  0.18  0.17  0.16
## Mathematics    0.05  0.07  0.07  0.07  0.07  0.06  0.07  0.06
## Physics        0.08  0.10  0.09  0.09  0.09  0.08  0.07  0.07
## Chemistry      0.17  0.14  0.13  0.12  0.11  0.10  0.10  0.10
## EarthSciences  0.04  0.04  0.03  0.03  0.03  0.03  0.03  0.03
## Biology        0.20  0.19  0.19  0.19  0.19  0.19  0.19  0.19
## Agriculture    0.07  0.05  0.05  0.05  0.05  0.05  0.05  0.05
## Psychology     0.12  0.09  0.11  0.11  0.12  0.13  0.14  0.15
## Sociology      0.03  0.02  0.03  0.03  0.03  0.03  0.04  0.04
## Economics      0.05  0.05  0.05  0.04  0.05  0.05  0.05  0.05
## Anthropology   0.01  0.01  0.01  0.01  0.01  0.02  0.02  0.02
## Others         0.05  0.05  0.06  0.07  0.08  0.08  0.08  0.08

6. Attraction Repulsion Matrix

v1 <- margin.table(s,1)
v2 <- margin.table(s,2)

V1 <- matrix(v1,ncol=1) 
V2 <- matrix(v2,nrow=1) 

# Expected number of observations under independence
E <- V1 %*% V2 /sum(s) # n_{jk}^{*} from above!

# We obtain attraction repulsion matrix D
# simply dividing each n_{jk} by n_{jk}^{*}
D <- s/E

# Values near 1: The year and science are independent
# Values < 1: The science is less frequent in that specific year
# Values > 1: The science is more frequent in that specific year
round(D, 2)
##               Y1960 Y1965 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975
## Engineering    0.71  1.11  1.09  1.04  1.03  0.99  0.97  0.91
## Mathematics    0.72  1.01  1.06  1.01  1.04  1.00  1.01  0.97
## Physics        1.00  1.18  1.10  1.09  1.02  0.99  0.86  0.83
## Chemistry      1.53  1.23  1.12  1.04  0.94  0.87  0.87  0.85
## EarthSciences  1.30  1.15  0.93  0.94  0.99  0.98  1.00  0.98
## Biology        1.04  0.98  0.99  1.01  0.99  1.01  0.99  1.00
## Agriculture    1.38  1.15  0.94  0.99  0.94  0.94  0.94  1.03
## Psychology     1.00  0.74  0.86  0.91  0.97  1.05  1.15  1.21
## Sociology      0.82  0.72  0.90  0.98  1.06  1.00  1.11  1.17
## Economics      1.17  1.10  1.00  0.90  0.98  1.03  0.98  1.01
## Anthropology   0.72  0.51  0.80  0.83  0.90  1.12  1.36  1.37
## Others         0.68  0.65  0.82  1.00  1.07  1.15  1.13  1.14

Lecture Salary Example

1. Contingency Table:

data <- read.table("SALARY.txt", header=T, row.names=1)
s <- as.matrix(data)
s
##     B1  B2 B3
## A1 150  40 10
## A2 190 350 60
## A3  10 110 80

2. Observed relative frequencies:

F <- prop.table(s)
round(F, 2)
##      B1   B2   B3
## A1 0.15 0.04 0.01
## A2 0.19 0.35 0.06
## A3 0.01 0.11 0.08

3. Row profiles:

rowp <- prop.table(s, 1)
round(rowp, 2)
##      B1   B2   B3
## A1 0.75 0.20 0.05
## A2 0.32 0.58 0.10
## A3 0.05 0.55 0.40

4. Column profiles:

colp <- prop.table(s, 2)
round(colp, 2)
##      B1   B2   B3
## A1 0.43 0.08 0.07
## A2 0.54 0.70 0.40
## A3 0.03 0.22 0.53

6. Attraction Repulsion Matrix

#    | B1    B2   B3 |  v1
# -------------------|-----
# A1 | 150   40   10 | 200
# A2 | 190  350   60 | 600  
# A3 |  10  110   80 | 200
# -------------------------
# v2 | 350  500  150 | 1000


v1 <- margin.table(s,1)
v2 <- margin.table(s,2)

V1 <- matrix(v1,ncol=1) 
V2 <- matrix(v2,nrow=1) 

E <- V1 %*% V2 /sum(s) # n_{jk}^{*} from above!

# We obtain D simply dividing each n_{jk} by n_{jk}^{*}
D <- s/E

round(D, 2)
##      B1   B2   B3
## A1 2.14 0.40 0.33
## A2 0.90 1.17 0.67
## A3 0.14 1.10 2.67

7. Attraction Repulsion Matrix – Visualization (PhD data)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(reshape)
# ggplot version:
melted <- melt(D_phd)
head(melted)
##              X1    X2     value
## 1   Engineering Y1960 0.7140280
## 2   Mathematics Y1960 0.7175789
## 3       Physics Y1960 1.0000924
## 4     Chemistry Y1960 1.5316270
## 5 EarthSciences Y1960 1.3008379
## 6       Biology Y1960 1.0425696
range(melted$value)
## [1] 0.5112797 1.5316270
ggplot(melted, aes(x=X1, y=X2, fill=value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = median(melted$value), limit = c(0.5, 1.55),
                       name="AR value") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 9, hjust = 1)) +
  coord_fixed() +
  ggtitle(label = c('Heatmap representation of the Attraction Repulsion Matrix')) +
  xlab('Science') +
  ylab('Year')