Power simulation in R: The repeated measures ANOVA

In this post I conduct a simulation analysis in R to estimate statistical power: the probability that a statistical test will reject the null hypothesis when it is false. Rather than simulating the data and obtaining p-values several times in a for loop (the old school way to run a simulation), I'll use the split-apply-combine strategy instead, trying out the new-ish R package dplyr, which is a much faster version of its predecesor plyr.

While the simulation approach (and the code) here could be extended to other statistical tests and frameworks, I focus here on a widespread analysis in the natural and social sciences: the 2 x 2 repeated measures (or within-subjects) analysis of variance (ANOVA). In the free and excellent G*Power program, analytical power analyses can be computed for repeated-measures ANOVA, where the user inputs the population effect size directly or estimates it from sums of squares. The reason I made this simulation script, however, is because I wanted to estimate power in repeated measures ANOVA directly from the effect (mu), variance (sigma), and population correlation between measures (rho), expressed in whatever units I'm measuring (response time, neural activity, etc.).

Statistical power is a function of sample size (N), the alpha level, and the population effect size. There are several different kinds of power analyses depending on which parameter is allowed to vary and which ones are set by the experimenter. Here I'll conduct a post hoc power analysis, which is not the same as a retrospective power analysis for the following reason. Let's say you just finished collecting data from a new experiment with an exciting new paradigm. Your approach is so unique that there is no precedent in the literature, and you haven't conducted any pilot studies. A retrospective power analysis in this case would be estimate power from your sample size, alpha level, and obtained effect size. The problem is, you have no idea how close the obtained effect is to the true population effect (assuming there is one), because sample effect sizes are noisy and can even be biased estimators of true effects. If, however, a population effect size can be estimated a priori by consulting the literature or previous studies in your lab, then a post hoc power analysis makes sense to do even, despite its name, before you run the study.

Imagine that I'm planning to conduct a semantic priming study, capitalizing on the well known finding that a word is recognized more quickly when it's preceded by a word with a related versus unrelated meaning. Let's assume I'd already conducted a pilot study with new stimuli–where the prime words referred to tasty foods. From this pilot data, and by consulting results of other priming studies, I know that the baseline response time should be about 700 milliseconds, and the priming effect should be a 30 ms reduction in response time. I also estimate the correlation between repeated measures (r = .75), and the standard deviation of the measures (SD = 150 ms). I'll also introduce a second within-subjects variable: time of test. Participants complete the experiment once on a full stomach, and then again one week later after two days of complete calorie deprivation. The prediction is for a standard priming effect (like before) at first test, but that the tasty prime words (e.g., “fries”, “hamburger”) will so distract the famished participants that any effect of priming will be swamped by a distraction effect–they'll actually be slower to respond to the target words. In other words, I predict no main effects but a significant interaction effect (the coveted double dissociation). But before I begin this study, here's what I want to know: how many subjects will I need to run before I have a reasonably good shot at detecting this interaction? Let's find out.

Define the simulation parameters and build the design matrix

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)

# define the parameters
mu = c(700, 670, 670, 700) # true effects (in this case, a double dissociation)
sigma = 150  # population standard deviation
rho = 0.75 # correlation between repeated measures
nsubs = 25 # how many subjects?
nsims = 5000 # how many simulation replicates?

# create 2 factors representing the 2 independent variables
cond = data.frame(
  X1 = rep(factor(letters[1:2]), nsubs * 2),
  X2 = rep(factor(letters[1:2]), nsubs, each=2))

# create a subjects factor
subject = factor(sort(rep(1:nsubs, 4)))

# combine above into the design matrix
dm = data.frame(subject, cond)

Build Sigma: the population variance-covariance matrix

I have the population standard deviation (sigma) and population correlation (rho) in hand, and I'll assume sphericity of the variance-covariance matrix (as we often do). Therefore the variance-covariance matrix will contain constant covariances on the off-diagonals and constant variances on the diagonals. The variance is just sigma squared, and the covariance can be computed from rho and sigma.

# create k x k matrix populated with sigma
sigma.mat <- rep(sigma, 4)
S <- matrix(sigma.mat, ncol=length(sigma.mat), nrow=length(sigma.mat))

# compute covariance between measures
Sigma <- t(S) * S * rho  

# put the variances on the diagonal 
diag(Sigma) <- sigma^2  

Run the simulation

Now we're ready to initiate the simulation. The empirical power will be the percentage of simulations in which the true effect is detected. I'm using stats::aov to obtain the p-values associated with each of the three effects. Although I usually use ez::ezANOVA for ANOVA analyses in R, a direct call to aov proved about 90% faster in testing.

# stack 'nsims' individual data frames into one large data frame
df = dm[rep(seq_len(nrow(dm)), nsims), ]

# add an index column to track the simulation run
df$simID = sort(rep(seq_len(nsims), nrow(dm)))

# sample the observed data from a multivariate normal distribution
# using MASS::mvrnorm with the parameters mu and Sigma created earlier
# and bind to the existing df

require(MASS)
make.y = expression(as.vector(t(mvrnorm(nsubs, mu, Sigma))))
df$y = as.vector(replicate(nsims, eval(make.y)))             

# use do(), the general purpose complement to the specialized data 
# manipulation functions available in dplyr, to run the ANOVA on
# each section of the grouped data frame created by group_by

require(dplyr)
require(car)
require(broom)

mods <- df %>% 
  group_by(simID) %>% 
    do(model = aov(y ~ X1 * X2 + Error(subject / (X1*X2)), qr=FALSE, data = .)) 

# extract p-values for each effect and store in a data frame
p = data.frame(
  mods %>% do(as.data.frame(tidy(.$model[[3]])$p.value[1])),
  mods %>% do(as.data.frame(tidy(.$model[[4]])$p.value[1])),
  mods %>% do(as.data.frame(tidy(.$model[[5]])$p.value[1])))
colnames(p) = c('X1','X2','Interaction')

Calculate and print the empirical power

The empirical power is easy to compute, it's just the proportion of simulation runs where p <. 05.

power = apply(as.matrix(p), 2, 
  function(x) round(mean(ifelse(x < .05, 1, 0) * 100),0))

With 25 subjects and 5000 simulation runs, the estimated power for each effect is:
Main effect X1: 5%
Main effect X2: 5%
Interaction: 49%

The Type I error rates for the nonexistent main effects are correct (5%). But the power to detect the interaction is pretty low. Let's look at the distributions of p-values.

Visualize the distributions of p-values.

The left plot contains the known effects (mu) with error bars denoting standard errors of the mean. (SEM is not particularly useful for comparing repeated measures, but it'll provide an idea of the variability). The right plot contains three histograms showing the empirical distributions of p-values for the two main effects and the interaction.

# plot the known effects
require(ggplot2)
require(gridExtra)

means = data.frame(cond[1:4, ], mu, SE = sigma / sqrt(nsubs))
plt1 = ggplot(means, aes(y = mu, x = X1, fill=X2)) +
  geom_bar(position = position_dodge(), stat="identity") +
  geom_errorbar(aes(ymin = mu-SE, ymax = mu+SE), 
    position = position_dodge(width=0.9), size=.6, width=.3) +
  coord_cartesian(ylim=c((.7*min(mu)), 1.2*max(mu))) +
  theme_bw()

# melt the data into a ggplot friendly 'long' format
require(reshape2)
plotData <- melt(p, value.name = 'p')

# plot each of the p-value distributions on a log scale
options(scipen = 999) # 'turn off' scientific notation
plt2 = ggplot(plotData, aes(x = p)) +
    scale_x_log10(breaks=c(1, 0.05, 0.001), 
                  labels=c(1, 0.05, 0.001)) +
    geom_histogram(colour = "darkblue", fill = "white") +
    geom_vline(xintercept = 0.05, colour='red') +
    facet_grid(variable ~ .) +
    labs(x = expression(Log[10]~P)) +
    theme(axis.text.x = element_text(color='black', size=7))

# arrange plots side by side and print
grid.arrange(plt1, plt2, nrow=1)

plot of chunk unnamed-chunk-4

The vertical red lines on the p-value histograms are centered at p = 0.05. If there's an effect in the population, then the power is the percentage of simulation runs in which the null hypothesis was correctly rejected. Of course if there's no effect in the population–as exemplified by the two non-existent main effects in this example–then I expect the “power” to be 5%: the nominal Type I error rate.

Our observed power here is only 49%, which is far too low, and actually is a common state of affairs in psychological research (Maxwell, 2004). So obviously we need a larger N to achieve 80% power–an arbitrary but nevertheless commonly used lower bound. Let's run the simulation again with twice as many subjects. Here are the results:

plot of chunk unnamed-chunk-5

This time around, with 50 subjects the estimated power is:
Main effect X1: 5%
Main effect X2: 5%
Interaction: 80%

Doubling the subjects brings us to 80% empirical power, which is about what statisticians recommend as a lower limit when planning a study. In a future study with this sample size we would have a reasonably good chance of rejecting the null hypothesis that the two factors do not interact (of course, assuming we estimated the population parameters accurately). But even with relatively high statistical power we face another lesser known issue.

Sample-to-sample P-value variability

A 2015 Nature Methods article draws attention to a common misunderstanding about the relationship between p-values and statistical power. There can be large variability in the p-value, despite high statistical power:

“Increasing sample size increases statistical power, and thus a progressively greater proportion of P values < 0.05 are obtained. However, we still face substantial variation in the magnitude of the P values returned. Although studies are often planned to have (an estimated) 80% power, when statistical power is indeed 80%, we still obtain a bewildering range of P values (pg. 181).”

Looking at the histogram for the interaction effect above, there really is a “bewildering” range of p-values. And the range of p-values for the underpowered study simulated earlier are even more bewildering!

The power analysis shown in this post can help the experimenter make an a priori decision about the neccessary sample size to achieve high statistical power. But unless statistical power is extremely high (which it rarely is given the myriad practical constraints we face), an identical replication of your study is likely to return a substantially different p-value than the value obtained the first time around. In light of this large sample-to-sample variability, the practice of coming to categorically different conclusions from an effect with a p = 0.04 versus another with p = 0.06 seems even more ridiculuous than it already did.


Here's the full code for the simulations:

# define the parameters
mu = NULL
sigma = NULL 
rho = NULL 
nsubs = NULL
nsims = NULL

# create 2 factors representing the 2 independent variables
cond = data.frame(
  X1 = rep(factor(letters[1:2]), nsubs * 2),
  X2 = rep(factor(letters[1:2]), nsubs, each=2))

# create a subjects factor
subject = factor(sort(rep(1:nsubs, 4)))

# combine above into the design matrix
dm = data.frame(subject, cond)

# create k x k matrix populated with sigma
sigma.mat <- rep(sigma, 4)
S <- matrix(sigma.mat, ncol=length(sigma.mat), nrow=length(sigma.mat))

# compute covariance between measures
Sigma <- t(S) * S * rho  

# put the variances on the diagonal 
diag(Sigma) <- sigma^2  

# stack 'nsims' individual data frames into one large data frame
df = dm[rep(seq_len(nrow(dm)), nsims), ]

# add an index column to track the simulation run
df$simID = sort(rep(seq_len(nsims), nrow(dm)))

# sample the observed data from a multivariate normal distribution
# using MASS::mvrnorm with the parameters mu and Sigma created earlier
# and bind to the existing df

require(MASS)
make.y = expression(as.vector(t(mvrnorm(nsubs, mu, Sigma))))
df$y = as.vector(replicate(nsims, eval(make.y)))             

# use do(), the general purpose complement to the specialized data 
# manipulation functions available in dplyr, to run the ANOVA on
# each section of the grouped data frame created by group_by

require(dplyr)
require(car)
require(broom)

mods <- df %>% 
  group_by(simID) %>% 
    do(model = aov(y ~ X1 * X2 + Error(subject / (X1*X2)), qr=FALSE, data = .)) 

# extract p-values for each effect and store in a data frame
p = data.frame(
  mods %>% do(as.data.frame(tidy(.$model[[3]])$p.value[1])),
  mods %>% do(as.data.frame(tidy(.$model[[4]])$p.value[1])),
  mods %>% do(as.data.frame(tidy(.$model[[5]])$p.value[1])))
colnames(p) = c('X1','X2','Interaction')

power = apply(as.matrix(p), 2, 
  function(x) round(mean(ifelse(x < .05, 1, 0) * 100),0))

# plot the known effects
require(ggplot2)
require(gridExtra)

means = data.frame(cond[1:4, ], mu, SE = sigma / sqrt(nsubs))
plt1 = ggplot(means, aes(y = mu, x = X1, fill=X2)) +
  geom_bar(position = position_dodge(), stat="identity") +
  geom_errorbar(aes(ymin = mu-SE, ymax = mu+SE), 
    position = position_dodge(width=0.9), size=.6, width=.3) +
  coord_cartesian(ylim=c((.7*min(mu)), 1.2*max(mu))) +
  theme_bw()

# melt the data into a ggplot friendly 'long' format
require(reshape2)
plotData <- melt(p, value.name = 'p')

# plot each of the p-value distributions on a log scale
options(scipen = 999) # 'turn off' scientific notation
plt2 = ggplot(plotData, aes(x = p)) +
    scale_x_log10(breaks=c(1, 0.05, 0.001), 
                  labels=c(1, 0.05, 0.001)) +
    geom_histogram(colour = "darkblue", fill = "white") +
    geom_vline(xintercept = 0.05, colour='red') +
    facet_grid(variable ~ .) +
    labs(x = expression(Log[10]~P)) +
    theme(axis.text.x = element_text(color='black', size=7))

# arrange plots side by side and print
grid.arrange(plt1, plt2, nrow=1)