--- title: "Tutorial_stats_test_RMEP2024" output: html_document date: "2024-02-05" --- #Introduction In this tutorial, we will compare the usability of two interfaces using simulated data. We'll conduct a t-test to compare means, a correlation analysis to explore relationships, and discuss the implications of p-hacking, HARKing, power and the significance of effect sizes. #Simulating Data First, we simulate data for two interfaces, A and B, measuring usability scores on a scale of 1-100. ```{r} set.seed(123) # Ensure reproducibility interfaceA <- rnorm(100, mean=70, sd=10) # Simulate scores for interface A interfaceB <- rnorm(100, mean=75, sd=10) # Simulate scores for interface B ``` # Visualizing the Data Before conducting our analysis, let's visualize the data to understand its distribution. ```{r} library(ggplot2)# ths is used for plotting library(tidyverse) data <- data.frame(Score = c(interfaceA, interfaceB), Interface = factor(rep(c("A", "B"), each=100))) ggplot(data, aes(x=Interface, y=Score, fill=Interface)) + stat_summary() + labs(title="Usability Scores for Interfaces A and B", y="Usability Score", x="") ``` # Conducting a T-test We'll use a t-test to compare the average usability scores of the two interfaces. ```{r} t.test(interfaceA, interfaceB,var.equal = TRUE) s_pooled_equal_var = (((sd(interfaceA)^2 + sd(interfaceB) ^ 2) / 2)) ** 0.5 # Calculate Cohen's d for equal sample sizes and equal variances cohen_d_equal_var = (mean(interfaceA) - mean(interfaceB)) / s_pooled_equal_var print(c("d is",cohen_d_equal_var)) ``` ## Explanation When you perform a Two Sample t-test, as shown in your output, it is aimed at comparing the means from two different groups to determine if there is a statistically significant difference between them. This test assumes that the data from both groups are independent from one another. Here's a detailed explanation of the output components: t = -2.2714: This value is the calculated t-statistic, which is a measure of the difference between the two sample means relative to the variation observed in the samples. A negative sign here indicates that the mean of the first group (interfaceA) is lower than the mean of the second group (interfaceB). The magnitude of the t-statistic reflects how far apart the group means are, in terms of the standard error of the difference between the means. df = 198: This represents the degrees of freedom for the t-test, calculated as the total number of observations across both groups minus the number of groups (in this case, 200 observations minus 2). Degrees of freedom are used to determine the critical values from the t-distribution that the t-statistic can be compared against to assess statistical significance. p-value = 0.0242: The p-value tells us about the likelihood of observing a difference as large as (or larger than) the one observed between the two sample means if there really was no difference between the population means (null hypothesis is true). A p-value of 0.0242 means there's a 2.42% chance of seeing such a difference by random chance alone. Because this value is below the commonly used threshold of 0.05, it suggests that the observed difference between group means is statistically significant, and we might reject the null hypothesis in favor of the alternative hypothesis. alternative hypothesis: true difference in means is not equal to 0: This states that the hypothesis being tested is that there is a nonzero difference between the population means of the two groups. Since the test result is significant, it supports the idea that the true difference in means is indeed not equal to zero. 95 percent confidence interval: -5.6428083 -0.3981376: This interval provides a range of values that are likely to contain the true difference in means between the two populations, with 95% confidence. The fact that this range does not include zero supports the conclusion that there is a significant difference between the group means. The interval suggests that the mean of interfaceA is between approximately 0.4 to 5.6 points lower than the mean of interfaceB. sample estimates: mean of x = 70.90406: This is the average score for interfaceA based on your sample. mean of y = 73.92453: This is the average score for interfaceB based on your sample. These sample means show the average usability scores obtained for each interface from the sampled data. The t-test analysis suggests that interfaceB, on average, has a higher usability score compared to interfaceA, and this difference is statistically significant based on the calculated p-value and confidence interval. Cohen's d represents a small to moderate negative effect size (standardized way of showing the difference in means), indicating that Interface B is rated more usable than Interface A, with the difference between them being less than one-third of a standard deviation. An APA style sentence summarizing the results from the Two Sample t-test might look like this: "The analysis revealed a statistically significant difference in usability scores between Interface A (M = 70.90, SD = 10) and Interface B (M = 73.92, SD = 10), t(198) = -2.2714, p = .024, d = 0.32, indicating that Interface B was rated as more usable than Interface A." # Correlation Analysis Suppose we also have a measure of user satisfaction for each interface. Let's examine the relationship between usability scores and satisfaction. We hypothesize that Usability is correlated with user satisfaction. ```{r} # Simulating user satisfaction scores satisfactionA <- interfaceA+rnorm(100, mean=0, sd=5) satisfactionB <- interfaceB+rnorm(100, mean=0, sd=5) # Combine data combinedScores <- c(interfaceA, interfaceB) combinedSatisfaction <- c(satisfactionA, satisfactionB) data<-data.frame(combinedScores,combinedSatisfaction) data%>%ggplot(aes(x=combinedScores,y=combinedSatisfaction))+geom_point() ``` ```{r} # Correlation analysis cor.test(combinedScores, combinedSatisfaction) ``` ## Explanation Pearson's product-moment correlation: This statistic measures the linear correlation between two variables, in this case, usability scores and user satisfaction. It ranges from -1 to 1, where 1 means a perfect positive linear relationship, -1 means a perfect negative linear relationship, and 0 indicates no linear relationship. t = 25.912: This is the t-statistic calculated as part of testing the hypothesis that the true correlation is zero. The magnitude of this value indicates a strong relationship between usability scores and satisfaction that is highly unlikely to be due to chance. df = 198: This represents the degrees of freedom for the test, calculated based on the number of data points (200) minus 2 (because we are dealing with two variables). p-value < 2.2e-16: The p-value indicates the probability of observing a correlation as strong as (or stronger than) the one calculated, assuming that there is no true correlation between the variables. A p-value this small (practically 0) strongly rejects the null hypothesis, suggesting that the observed correlation is significantly different from zero. 95 percent confidence interval: 0.8428058 to 0.9069505: This range indicates with 95% confidence where the true correlation coefficient lies. The fact that the entire interval is far from zero further supports a strong positive relationship between usability scores and satisfaction. sample estimates: cor = 0.8787886: This is the Pearson correlation coefficient based on your sample data, indicating a very strong positive correlation between usability scores and user satisfaction. This suggests that as usability scores increase, user satisfaction also increases, and vice versa. APA Style Sentence Example "In assessing the relationship between usability scores and user satisfaction, a Pearson's product-moment correlation analysis was conducted, revealing a strong positive correlation, r(198) = 0.879, p < .001 indicating that higher usability scores are associated with higher levels of user satisfaction." # Power analysis ```{r} # Load necessary library for power analysis library(pwr) # Assume we want to detect a small effect size (d = 0.2) with 80% power and alpha of 0.05 effect_size <- 0.2 power <- 0.8 alpha <- 0.05 # Calculate required sample size sample_size <- pwr.t.test(d = effect_size, power = power, sig.level = alpha, type = "two.sample", alternative = "two.sided")$n # Round up since sample size must be an integer sample_size <- ceiling(sample_size) cat("Calculated sample size per group:", sample_size, "\n") # Assuming the mean and standard deviation from the initial scenario, # simulate data for each interface with the calculated sample size set.seed(123) # Ensure reproducibility interfaceA <- rnorm(sample_size, mean=70, sd=10) interfaceB <- rnorm(sample_size, mean=72, sd=10) # Perform the t-test with the predetermined sample sizes test_result <- t.test(interfaceA, interfaceB) # Output the t-test results cat("P-value with predetermined sample size:", test_result$p.value, "\n") ``` Power analysis is crucial for designing studies that are capable of detecting meaningful effects, thereby avoiding Type II errors (failing to reject a false null hypothesis). It guides researchers in determining the appropriate sample size needed to confidently detect an effect, ensuring that the study can contribute valuable insights without unnecessarily depleting resources or subjecting participants to unwarranted procedures. Power analysis used to determine the sample size required to detect an effect of a given size with a certain degree of confidence. Power is defined as the probability that a statistical test will correctly reject the null hypothesis when it is false, i.e., the likelihood of finding a statistically significant result if there is a true effect. Here's how power is illustrated and applied in the given example: ## Key Components of Power Analysis: ###Effect Size (d = 0.2): This is a measure of the magnitude of the difference or effect that the researcher expects to find between two groups. In this example, an effect size of 0.2 is considered small according to Cohen's conventions. It quantifies the expected difference in means in terms of standard deviation units. ###Power (0.8 or 80%): This is set by the researcher and indicates the probability of detecting an effect of the specified size (if it truly exists) given the sample size. A power of 0.8 means there's an 80% chance of finding a statistically significant difference between the groups if the actual effect size is at least as large as the one specified (d = 0.2). ###Significance Level (Alpha = 0.05): This is the threshold for determining statistical significance, commonly set at 0.05. It represents a 5% risk of concluding that an effect exists when there is no actual effect (Type I error). ## Visualizing power ```{r} # Define parameters for the power curve effect_size <- 0.5 # Medium effect size sig_level <- 0.05 # Significance level (alpha) # Generate a sequence of sample sizes from 10 to 200 to examine how power changes sample_sizes <- seq(from = 10, to = 200, by = 5) # Calculate power for each sample size powers <- sapply(sample_sizes, function(n) { pwr.t.test(d = effect_size, n = n, sig.level = sig_level, type = "two.sample", alternative = "two.sided")$power }) # Create a dataframe for plotting power_df <- data.frame(SampleSize = sample_sizes, Power = powers) # Generate the plot ggplot(power_df, aes(x = SampleSize, y = Power)) + geom_line(color = "blue") + geom_point(color = "red") + theme_minimal() + labs(title = "Power Curve for T-test", subtitle = paste("Effect size:", effect_size, "Significance level:", sig_level), x = "Sample Size", y = "Power") + geom_hline(yintercept = 0.8, linetype = "dashed", color = "green") + annotate("text", x = 150, y = 0.8, label = "80% Power", hjust = 0, vjust = 0, color = "green") ```
__Let's now consider bad practices that harm the research process. DO NOT DO THIS! THIS WILL GET YOU AND YOUR RESEARCH TEAM FIRED__
############################################################################################################################## To illustrate HARKing (Hypothesizing After the Results are Known), we'll first simulate a scenario where user satisfaction does not correlate with usability scores anymore, perhaps due to poor service quality affecting satisfaction irrespective of usability. We'll then adjust the hypothesis to fit the new findings, exemplifying HARKing. Simulating New Scenario Let's assume that due to deteriorating service quality, user satisfaction no longer correlates with usability scores: ## Simulating user satisfaction scores independent of usability scores ```{r} interfaceA <- rnorm(100, mean=70, sd=10) # Simulate scores for interface A interfaceB <- rnorm(100, mean=75, sd=10) # Simulate scores for interface B satisfactionA_indep <- rnorm(100, mean=50, sd=15) # Independent satisfaction scores for A satisfactionB_indep <- rnorm(100, mean=50, sd=15) # Independent satisfaction scores for B ``` # Combine data ```{r} combinedScores_indep <- c(interfaceA, interfaceB) combinedSatisfaction_indep <- c(satisfactionA_indep, satisfactionB_indep) data<-data.frame(satisfactionA_indep,satisfactionB_indep) data%>%ggplot(aes(x=satisfactionA_indep,y=satisfactionB_indep))+geom_point() ``` # Correlation analysis ```{r} cor.test(combinedScores_indep, combinedSatisfaction_indep) ``` Assuming this analysis shows no significant correlation (or a significantly lower correlation than initially observed), we then "discover" this lack of correlation and formulate a new hypothesis that aligns with these findings. ## Original Hypothesis Originally, we hypothesized a strong positive correlation between usability scores and user satisfaction, supported by the data analysis. ## Adjusted Hypothesis for HARKing "After analyzing the data, we hypothesize that user satisfaction is independent of interface usability scores, potentially due to external factors such as overall service quality overshadowing the impact of usability on satisfaction." This change in hypothesis, made after observing the data, is a clear example of HARKing. It involves creating a hypothesis that perfectly fits the observed results rather than formulating a hypothesis before conducting the analysis. ## Explanation of Adjusted Findings If the correlation analysis with the new satisfaction scores (assumed to be independent of usability scores) shows no significant correlation, it would suggest that factors other than usability significantly impact user satisfaction. This could lead to valuable insights about the broader context in which the interfaces operate, but it also illustrates the problematic nature of HARKing. Formulating hypotheses after results are known can lead to biased research practices and conclusions that may not be replicable or may overfit specific datasets. ## APA Style Sentence for Adjusted Findings "In light of new findings, our analysis revealed no significant correlation between usability scores and user satisfaction, r(198) = [insert new correlation coefficient], p = [insert new p-value], suggesting that factors beyond interface usability, such as overall service quality, may play a pivotal role in determining user satisfaction." This narrative shift, post hoc, to explain the data as if it were expected, highlights the critical issue with HARKing: __it undermines the predictive power of scientific hypotheses by conforming them to observed outcomes rather than testing pre-existing theories__. # R Code for P-hacking by Excluding Participants set.seed(123) # Ensure reproducibility ## Simulate initial data ```{r} interfaceA <- rnorm(100, mean=70, sd=10) interfaceB <- rnorm(100, mean=70, sd=12) ``` # while-loop removing data ```{r} # Initialize p-value and counter for exclusions p_value <- 1 exclusions <- 0 while(p_value > 0.05 && length(interfaceA) > 2) { # Ensure at least 3 data points remain exclusions <- exclusions + 1 # Exclude the participant with the lowest score from interfaceA interfaceA <- interfaceA[-which.min(interfaceA)] #randomly removing will also work but it takes longer # Perform the t-test test_result <- t.test(interfaceA, interfaceB) p_value <- test_result$p.value cat("Hacking round ",exclusions," p is ",p_value,"\n") } ``` ## Excluding Participants This approach can lead to findings that are not representative of the true population effect. By selectively removing data, researchers can manipulate results to appear significant, potentially leading to false conclusions about the effectiveness or impact of an intervention or treatment. # P-hacking by increasing the sample size ```{r} # Simulate initial data for two interfaces interfaceA <- rnorm(30, mean=70, sd=10) # Initial sample size for interface A interfaceB <- rnorm(30, mean=71, sd=10) # Initial sample size for interface B t.test(interfaceA, interfaceB) ``` No difference detected ```{r} # Initialize variables for while loop p_value <- 1 # Starting with a p-value greater than 0.05 additional_participants <- 0 # Counter for additional participants # Conduct a while loop to add participants until p-value is less than 0.05 while(p_value > 0.05) { # Add one participant to each group interfaceA <- c(interfaceA, rnorm(1, mean=70, sd=10)) interfaceB <- c(interfaceB, rnorm(1, mean=70.5, sd=10)) # Update the counter additional_participants <- additional_participants + 2 # Perform the t-test with updated groups test_result <- t.test(interfaceA, interfaceB) p_value <- test_result$p.value } # Output the results cat("Total additional participants added to reach significance:", additional_participants, "\n") cat("New sample size for each group:", length(interfaceA), "\n") cat("P-value after adding participants:", p_value, "\n") ``` # Conclusion This tutorial provided a comprehensive overview of statistical analysis techniques, including t-tests, correlation analysis, power analysis, and the potential pitfalls of HARKing and P-hacking. Researchers are encouraged to conduct rigorous and ethical research by avoiding data manipulation and ensuring proper hypothesis formulation and power analysis in study design to ensure the robustness of the research process.