--- title: "Tutorial_stats_test_RMEP2024" output: html_document date: "2024-02-05" --- #Introduction In this tutorial, we will compare the usability of two interfaces using simulated data. We'll conduct a t-test to compare means, a correlation analysis to explore relationships, and discuss the implications of p-hacking, HARKing, power and the significance of effect sizes. #Simulating Data First, we simulate data for two interfaces, A and B, measuring usability scores on a scale of 1-100. ```{r} set.seed(123) # Ensure reproducibility interfaceA <- rnorm(100, mean=70, sd=10) # Simulate scores for interface A interfaceB <- rnorm(100, mean=75, sd=10) # Simulate scores for interface B ``` # Visualizing the Data Before conducting our analysis, let's visualize the data to understand its distribution. ```{r} library(ggplot2)# ths is used for plotting library(tidyverse) data <- data.frame(Score = c(interfaceA, interfaceB), Interface = factor(rep(c("A", "B"), each=100))) ggplot(data, aes(x=Interface, y=Score, fill=Interface)) + stat_summary() + labs(title="Usability Scores for Interfaces A and B", y="Usability Score", x="") ``` # Conducting a T-test We'll use a t-test to compare the average usability scores of the two interfaces. ```{r} t.test(interfaceA, interfaceB,var.equal = TRUE) s_pooled_equal_var = (((sd(interfaceA)^2 + sd(interfaceB) ^ 2) / 2)) ** 0.5 # Calculate Cohen's d for equal sample sizes and equal variances cohen_d_equal_var = (mean(interfaceA) - mean(interfaceB)) / s_pooled_equal_var print(c("d is",cohen_d_equal_var)) ``` ## Explanation When you perform a Two Sample t-test, as shown in your output, it is aimed at comparing the means from two different groups to determine if there is a statistically significant difference between them. This test assumes that the data from both groups are independent from one another. Here's a detailed explanation of the output components: t = -2.2714: This value is the calculated t-statistic, which is a measure of the difference between the two sample means relative to the variation observed in the samples. A negative sign here indicates that the mean of the first group (interfaceA) is lower than the mean of the second group (interfaceB). The magnitude of the t-statistic reflects how far apart the group means are, in terms of the standard error of the difference between the means. df = 198: This represents the degrees of freedom for the t-test, calculated as the total number of observations across both groups minus the number of groups (in this case, 200 observations minus 2). Degrees of freedom are used to determine the critical values from the t-distribution that the t-statistic can be compared against to assess statistical significance. p-value = 0.0242: The p-value tells us about the likelihood of observing a difference as large as (or larger than) the one observed between the two sample means if there really was no difference between the population means (null hypothesis is true). A p-value of 0.0242 means there's a 2.42% chance of seeing such a difference by random chance alone. Because this value is below the commonly used threshold of 0.05, it suggests that the observed difference between group means is statistically significant, and we might reject the null hypothesis in favor of the alternative hypothesis. alternative hypothesis: true difference in means is not equal to 0: This states that the hypothesis being tested is that there is a nonzero difference between the population means of the two groups. Since the test result is significant, it supports the idea that the true difference in means is indeed not equal to zero. 95 percent confidence interval: -5.6428083 -0.3981376: This interval provides a range of values that are likely to contain the true difference in means between the two populations, with 95% confidence. The fact that this range does not include zero supports the conclusion that there is a significant difference between the group means. The interval suggests that the mean of interfaceA is between approximately 0.4 to 5.6 points lower than the mean of interfaceB. sample estimates: mean of x = 70.90406: This is the average score for interfaceA based on your sample. mean of y = 73.92453: This is the average score for interfaceB based on your sample. These sample means show the average usability scores obtained for each interface from the sampled data. The t-test analysis suggests that interfaceB, on average, has a higher usability score compared to interfaceA, and this difference is statistically significant based on the calculated p-value and confidence interval. Cohen's d represents a small to moderate negative effect size (standardized way of showing the difference in means), indicating that Interface B is rated more usable than Interface A, with the difference between them being less than one-third of a standard deviation. An APA style sentence summarizing the results from the Two Sample t-test might look like this: "The analysis revealed a statistically significant difference in usability scores between Interface A (M = 70.90, SD = 10) and Interface B (M = 73.92, SD = 10), t(198) = -2.2714, p = .024, d = 0.32, indicating that Interface B was rated as more usable than Interface A." # Correlation Analysis Suppose we also have a measure of user satisfaction for each interface. Let's examine the relationship between usability scores and satisfaction. We hypothesize that Usability is correlated with user satisfaction. ```{r} # Simulating user satisfaction scores satisfactionA <- interfaceA+rnorm(100, mean=0, sd=5) satisfactionB <- interfaceB+rnorm(100, mean=0, sd=5) # Combine data combinedScores <- c(interfaceA, interfaceB) combinedSatisfaction <- c(satisfactionA, satisfactionB) data<-data.frame(combinedScores,combinedSatisfaction) data%>%ggplot(aes(x=combinedScores,y=combinedSatisfaction))+geom_point() ``` ```{r} # Correlation analysis cor.test(combinedScores, combinedSatisfaction) ``` ## Explanation Pearson's product-moment correlation: This statistic measures the linear correlation between two variables, in this case, usability scores and user satisfaction. It ranges from -1 to 1, where 1 means a perfect positive linear relationship, -1 means a perfect negative linear relationship, and 0 indicates no linear relationship. t = 25.912: This is the t-statistic calculated as part of testing the hypothesis that the true correlation is zero. The magnitude of this value indicates a strong relationship between usability scores and satisfaction that is highly unlikely to be due to chance. df = 198: This represents the degrees of freedom for the test, calculated based on the number of data points (200) minus 2 (because we are dealing with two variables). p-value < 2.2e-16: The p-value indicates the probability of observing a correlation as strong as (or stronger than) the one calculated, assuming that there is no true correlation between the variables. A p-value this small (practically 0) strongly rejects the null hypothesis, suggesting that the observed correlation is significantly different from zero. 95 percent confidence interval: 0.8428058 to 0.9069505: This range indicates with 95% confidence where the true correlation coefficient lies. The fact that the entire interval is far from zero further supports a strong positive relationship between usability scores and satisfaction. sample estimates: cor = 0.8787886: This is the Pearson correlation coefficient based on your sample data, indicating a very strong positive correlation between usability scores and user satisfaction. This suggests that as usability scores increase, user satisfaction also increases, and vice versa. APA Style Sentence Example "In assessing the relationship between usability scores and user satisfaction, a Pearson's product-moment correlation analysis was conducted, revealing a strong positive correlation, r(198) = 0.879, p < .001 indicating that higher usability scores are associated with higher levels of user satisfaction." # Power analysis ```{r} # Load necessary library for power analysis library(pwr) # Assume we want to detect a small effect size (d = 0.2) with 80% power and alpha of 0.05 effect_size <- 0.2 power <- 0.8 alpha <- 0.05 # Calculate required sample size sample_size <- pwr.t.test(d = effect_size, power = power, sig.level = alpha, type = "two.sample", alternative = "two.sided")$n # Round up since sample size must be an integer sample_size <- ceiling(sample_size) cat("Calculated sample size per group:", sample_size, "\n") # Assuming the mean and standard deviation from the initial scenario, # simulate data for each interface with the calculated sample size set.seed(123) # Ensure reproducibility interfaceA <- rnorm(sample_size, mean=70, sd=10) interfaceB <- rnorm(sample_size, mean=72, sd=10) # Perform the t-test with the predetermined sample sizes test_result <- t.test(interfaceA, interfaceB) # Output the t-test results cat("P-value with predetermined sample size:", test_result$p.value, "\n") ``` Power analysis is crucial for designing studies that are capable of detecting meaningful effects, thereby avoiding Type II errors (failing to reject a false null hypothesis). It guides researchers in determining the appropriate sample size needed to confidently detect an effect, ensuring that the study can contribute valuable insights without unnecessarily depleting resources or subjecting participants to unwarranted procedures. Power analysis used to determine the sample size required to detect an effect of a given size with a certain degree of confidence. Power is defined as the probability that a statistical test will correctly reject the null hypothesis when it is false, i.e., the likelihood of finding a statistically significant result if there is a true effect. Here's how power is illustrated and applied in the given example: ## Key Components of Power Analysis: ###Effect Size (d = 0.2): This is a measure of the magnitude of the difference or effect that the researcher expects to find between two groups. In this example, an effect size of 0.2 is considered small according to Cohen's conventions. It quantifies the expected difference in means in terms of standard deviation units. ###Power (0.8 or 80%): This is set by the researcher and indicates the probability of detecting an effect of the specified size (if it truly exists) given the sample size. A power of 0.8 means there's an 80% chance of finding a statistically significant difference between the groups if the actual effect size is at least as large as the one specified (d = 0.2). ###Significance Level (Alpha = 0.05): This is the threshold for determining statistical significance, commonly set at 0.05. It represents a 5% risk of concluding that an effect exists when there is no actual effect (Type I error). ## Visualizing power ```{r} # Define parameters for the power curve effect_size <- 0.5 # Medium effect size sig_level <- 0.05 # Significance level (alpha) # Generate a sequence of sample sizes from 10 to 200 to examine how power changes sample_sizes <- seq(from = 10, to = 200, by = 5) # Calculate power for each sample size powers <- sapply(sample_sizes, function(n) { pwr.t.test(d = effect_size, n = n, sig.level = sig_level, type = "two.sample", alternative = "two.sided")$power }) # Create a dataframe for plotting power_df <- data.frame(SampleSize = sample_sizes, Power = powers) # Generate the plot ggplot(power_df, aes(x = SampleSize, y = Power)) + geom_line(color = "blue") + geom_point(color = "red") + theme_minimal() + labs(title = "Power Curve for T-test", subtitle = paste("Effect size:", effect_size, "Significance level:", sig_level), x = "Sample Size", y = "Power") + geom_hline(yintercept = 0.8, linetype = "dashed", color = "green") + annotate("text", x = 150, y = 0.8, label = "80% Power", hjust = 0, vjust = 0, color = "green") ```