using DataFrames
using DataFramesMeta
using CSV
using MixedModels
using Gadfly
using Statistics
using Distributions
using Random
Random.seed!(90059)
TaskLocalRNG()
In this notebook we’ll go through a quick example of setting up a power analysis, using data from an existing, highly-powered study to make credible parameter estimates. The code for setting up a simulation is inspired by/shamelessly stolen from a great tutorial about this topic by DeBruine & Barr (2021) and Lisa DeBruine’s appendix on its application for sensitivity analysis (DeBruine & Barr, 2020). The aim of this tutorial is two-fold:
Before we do anything, let’s import all the packages we will need:
using DataFrames
using DataFramesMeta
using CSV
using MixedModels
using Gadfly
using Statistics
using Distributions
using Random
Random.seed!(90059)
TaskLocalRNG()
In this example, we will make an estimate of the number of participants we need to replicate a simple and well-established experimental finding: The capture of attention by a colour singleton during visual search for a unique shape singleton. For this example, we are fortunate in that there are many studies of this effect for us to base our parameter estimates on. One recent example is a highly-powered study by Kirsten Adam from the Serences lab purpose-built to be used for sensitivity analysis. First let’s import the data for our specific case from the Adam et al. (2021) study, which is freely available in an OSF repository, and look at the data.
Note that when previous data doesn’t exist (or even if it does, but you don’t trust that it’s sufficient to base your effect estimates on) there are alternative ways of determining such parameters, including formally determining a smallest effect size of interest (Lakens et al., 2018).
The data we chose is from experiment 1c: variable colour singleton search. We are interested in the raw trial data, not the summary data (We are doing a mixed model after all, not an ANOVA) so we have to grab all the raw files and concatenate them.
all_files = readdir("Experiment_1c", join = true)
dfs = CSV.read.(all_files,DataFrame)
df = vcat(dfs...)
Row | experiment | subject | uniqueID | block | trial | trial_cumulative | age | gender | color_cond_blocking | color_history | shape_type | set_size | distractor | rt | acc | target_orient | response_key | response_made | iti | target_color | target_item | distractor_item | target_X | target_Y | distractor_X | distractor_Y | item1_X | item1_Y | item2_X | item2_Y | item3_X | item3_Y | item4_X | item4_Y | item5_X | item5_Y | item6_X | item6_Y | item1_shape | item2_shape | item3_shape | item4_shape | item5_shape | item6_shape | item1_line | item2_line | item3_line | item4_line | item5_line | item6_line |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String3 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Any | String3 | String15 | String15 | Int64 | String7 | Float64 | Int64 | String15 | String15 | Int64 | Float64 | String7 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Int64 | Int64 | Int64 | Float64 | Float64 | Float64 | String15 | String15 | String15 | String15 | String15 | String15 | |
1 | 1c | 10 | 100148 | 1 | 1 | 1 | 21 | M | NA | variable | homogeneous | 5 | absent | 1.09181 | 1 | horizontal | horizontal | 1 | 0.929414 | red | 1 | NaN | 721.201 | 842.519 | NaN | NaN | 721.201 | 842.519 | 1006.3 | 749.885 | 1006.3 | 450.115 | 721.201 | 357.481 | 545.0 | 600.0 | 0.0 | 0.0 | 4 | 0 | 0 | 0.0 | 0.0 | NaN | horizontal | horizontal | vertical | horizontal | vertical | none |
2 | 1c | 10 | 100148 | 1 | 2 | 2 | 21 | M | NA | variable | homogeneous | 6 | absent | 1.42371 | 1 | vertical | vertical | 1 | 0.705883 | red | 3 | NaN | 938.883 | 386.139 | NaN | NaN | 915.768 | 827.207 | 1054.65 | 613.346 | 938.883 | 386.139 | 684.232 | 372.793 | 545.349 | 586.654 | 661.117 | 813.861 | 0 | 0 | 4 | 0.0 | 0.0 | 0.0 | vertical | horizontal | vertical | horizontal | horizontal | vertical |
3 | 1c | 10 | 100148 | 1 | 3 | 3 | 21 | M | NA | variable | homogeneous | 5 | present | 1.57339 | 1 | vertical | vertical | 1 | 1.04706 | red | 3 | 5.0 | 853.017 | 350.572 | 610.498 | 770.628 | 903.718 | 832.954 | 1053.6 | 573.345 | 853.017 | 350.572 | 579.164 | 472.5 | 610.498 | 770.628 | 0.0 | 0.0 | 0 | 0 | 4 | 0.0 | 0.0 | NaN | horizontal | vertical | vertical | vertical | vertical | none |
4 | 1c | 10 | 100148 | 1 | 4 | 4 | 21 | M | NA | variable | homogeneous | 5 | present | 1.82955 | 1 | horizontal | horizontal | 1 | 0.62353 | red | 2 | 4.0 | 1051.13 | 555.72 | 570.808 | 488.215 | 919.715 | 825.152 | 1051.13 | 555.72 | 835.489 | 347.482 | 570.808 | 488.215 | 622.862 | 783.432 | 0.0 | 0.0 | 0 | 4 | 0 | 0.0 | 0.0 | NaN | horizontal | horizontal | vertical | vertical | horizontal | none |
5 | 1c | 10 | 100148 | 1 | 5 | 5 | 21 | M | NA | variable | homogeneous | 4 | absent | 1.21572 | 1 | vertical | vertical | 1 | 0.682353 | green | 3 | NaN | 746.983 | 350.572 | NaN | NaN | 853.017 | 849.428 | 1049.43 | 546.983 | 746.983 | 350.572 | 550.572 | 653.017 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 4 | 0.0 | NaN | NaN | vertical | vertical | vertical | vertical | none | none |
6 | 1c | 10 | 100148 | 1 | 6 | 6 | 21 | M | NA | variable | homogeneous | 5 | absent | 0.852387 | 1 | vertical | vertical | 1 | 0.811766 | green | 4 | NaN | 800.0 | 345.0 | NaN | NaN | 650.115 | 806.299 | 949.885 | 806.299 | 1042.52 | 521.201 | 800.0 | 345.0 | 557.481 | 521.201 | 0.0 | 0.0 | 0 | 0 | 0 | 4.0 | 0.0 | NaN | horizontal | horizontal | horizontal | vertical | horizontal | none |
7 | 1c | 10 | 100148 | 1 | 7 | 7 | 21 | M | NA | variable | homogeneous | 4 | absent | 0.622422 | 1 | horizontal | horizontal | 1 | 0.811766 | green | 3 | NaN | 804.45 | 345.039 | NaN | NaN | 795.55 | 854.961 | 1054.96 | 604.45 | 804.45 | 345.039 | 545.039 | 595.55 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 4 | 0.0 | NaN | NaN | vertical | vertical | horizontal | vertical | none | none |
8 | 1c | 10 | 100148 | 1 | 8 | 8 | 21 | M | NA | variable | homogeneous | 4 | absent | 1.52177 | 1 | horizontal | horizontal | 1 | 0.752942 | red | 2 | NaN | 1029.19 | 488.215 | NaN | NaN | 911.785 | 829.192 | 1029.19 | 488.215 | 688.215 | 370.808 | 570.808 | 711.785 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 4 | 0 | 0.0 | NaN | NaN | horizontal | horizontal | horizontal | horizontal | none | none |
9 | 1c | 10 | 100148 | 1 | 9 | 9 | 21 | M | NA | variable | homogeneous | 5 | present | 0.662533 | 1 | horizontal | horizontal | 1 | 0.705883 | red | 2 | 3.0 | 1045.12 | 529.712 | 808.899 | 345.155 | 942.594 | 811.405 | 1045.12 | 529.712 | 808.899 | 345.155 | 560.378 | 512.785 | 643.006 | 800.943 | 0.0 | 0.0 | 0 | 4 | 0 | 0.0 | 0.0 | NaN | horizontal | horizontal | vertical | horizontal | horizontal | none |
10 | 1c | 10 | 100148 | 1 | 10 | 10 | 21 | M | NA | variable | homogeneous | 3 | absent | 0.733259 | 1 | vertical | vertical | 1 | 0.658824 | red | 1 | NaN | 604.659 | 763.911 | NaN | NaN | 604.659 | 763.911 | 1039.62 | 687.215 | 755.72 | 348.874 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4 | 0 | 0 | NaN | NaN | NaN | vertical | horizontal | vertical | none | none | none |
11 | 1c | 10 | 100148 | 1 | 11 | 11 | 21 | M | NA | variable | homogeneous | 6 | absent | 0.769892 | 1 | horizontal | horizontal | 1 | 0.894119 | green | 3 | NaN | 1042.52 | 521.201 | NaN | NaN | 746.983 | 849.428 | 989.502 | 770.628 | 1042.52 | 521.201 | 853.017 | 350.572 | 610.498 | 429.372 | 557.481 | 678.799 | 0 | 0 | 4 | 0.0 | 0.0 | 0.0 | vertical | horizontal | horizontal | horizontal | vertical | vertical |
12 | 1c | 10 | 100148 | 1 | 12 | 12 | 21 | M | NA | variable | homogeneous | 5 | present | 1.04978 | 1 | vertical | vertical | 1 | 0.823531 | green | 5 | 1.0 | 549.685 | 648.656 | 768.923 | 853.099 | 768.923 | 853.099 | 1031.11 | 707.768 | 973.91 | 413.505 | 676.374 | 376.972 | 549.685 | 648.656 | 0.0 | 0.0 | 0 | 0 | 0 | 0.0 | 4.0 | NaN | vertical | vertical | vertical | vertical | vertical | none |
13 | 1c | 10 | 100148 | 1 | 13 | 13 | 21 | M | NA | variable | homogeneous | 4 | absent | 0.662078 | 1 | vertical | vertical | 1 | 0.752942 | green | 4 | NaN | 546.397 | 626.655 | NaN | NaN | 826.655 | 853.603 | 1053.6 | 573.345 | 773.345 | 346.397 | 546.397 | 626.655 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 4.0 | NaN | NaN | horizontal | horizontal | horizontal | vertical | none | none |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
38389 | 1c | 9 | 100149 | 25 | 53 | 1589 | 18 | false | NA | variable | homogeneous | 5 | absent | 0.681741 | 1 | horizontal | horizontal | 1 | 0.835291 | green | 2 | NaN | 1054.65 | 613.346 | NaN | NaN | 865.999 | 846.311 | 1054.65 | 613.346 | 891.384 | 361.937 | 601.828 | 439.523 | 586.139 | 738.883 | 0.0 | 0.0 | 0 | 4 | 0 | 0.0 | 0.0 | NaN | horizontal | horizontal | horizontal | horizontal | horizontal | none |
38390 | 1c | 9 | 100149 | 25 | 54 | 1590 | 18 | false | NA | variable | homogeneous | 6 | absent | 0.63758 | 1 | vertical | vertical | 1 | 1.09411 | red | 2 | NaN | 1020.84 | 727.5 | NaN | NaN | 800.0 | 855.0 | 1020.84 | 727.5 | 1020.84 | 472.5 | 800.0 | 345.0 | 579.164 | 472.5 | 579.164 | 727.5 | 0 | 4 | 0 | 0.0 | 0.0 | 0.0 | vertical | vertical | horizontal | horizontal | vertical | horizontal |
38391 | 1c | 9 | 100149 | 25 | 55 | 1591 | 18 | false | NA | variable | homogeneous | 5 | present | 0.604192 | 1 | horizontal | horizontal | 1 | 1.08235 | red | 2 | 3.0 | 1046.31 | 534.001 | 813.346 | 345.349 | 938.883 | 813.861 | 1046.31 | 534.001 | 813.346 | 345.349 | 561.937 | 508.616 | 639.523 | 798.172 | 0.0 | 0.0 | 0 | 4 | 0 | 0.0 | 0.0 | NaN | horizontal | horizontal | horizontal | horizontal | vertical | none |
38392 | 1c | 9 | 100149 | 25 | 56 | 1592 | 18 | false | NA | variable | homogeneous | 3 | absent | 0.670488 | 1 | horizontal | horizontal | 1 | 0.905879 | red | 3 | NaN | 653.738 | 391.116 | NaN | NaN | 692.232 | 831.108 | 1054.03 | 577.775 | 653.738 | 391.116 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 4 | NaN | NaN | NaN | horizontal | vertical | horizontal | none | none | none |
38393 | 1c | 9 | 100149 | 25 | 57 | 1593 | 18 | false | NA | variable | homogeneous | 4 | absent | 0.66062 | 1 | horizontal | horizontal | 1 | 1.02352 | green | 1 | NaN | 643.006 | 800.943 | NaN | NaN | 643.006 | 800.943 | 1000.94 | 756.994 | 956.994 | 399.057 | 599.057 | 443.006 | 0.0 | 0.0 | 0.0 | 0.0 | 4 | 0 | 0 | 0.0 | NaN | NaN | horizontal | horizontal | vertical | vertical | none | none |
38394 | 1c | 9 | 100149 | 25 | 58 | 1594 | 18 | false | NA | variable | homogeneous | 4 | absent | 0.618374 | 1 | horizontal | horizontal | 1 | 0.811762 | green | 2 | NaN | 1027.21 | 484.232 | NaN | NaN | 915.768 | 827.207 | 1027.21 | 484.232 | 684.232 | 372.793 | 572.793 | 715.768 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 4 | 0 | 0.0 | NaN | NaN | vertical | horizontal | horizontal | vertical | none | none |
38395 | 1c | 9 | 100149 | 25 | 59 | 1595 | 18 | false | NA | variable | homogeneous | 5 | absent | 0.573223 | 1 | vertical | vertical | 1 | 0.717646 | red | 3 | NaN | 998.172 | 439.523 | NaN | NaN | 734.001 | 846.311 | 1013.86 | 738.883 | 998.172 | 439.523 | 708.616 | 361.937 | 545.349 | 613.346 | 0.0 | 0.0 | 0 | 0 | 4 | 0.0 | 0.0 | NaN | vertical | vertical | vertical | vertical | horizontal | none |
38396 | 1c | 9 | 100149 | 25 | 60 | 1596 | 18 | false | NA | variable | homogeneous | 5 | present | 0.54497 | 1 | horizontal | horizontal | 1 | 0.811762 | red | 3 | 2.0 | 887.215 | 360.378 | 1054.84 | 608.899 | 870.288 | 845.122 | 1054.84 | 608.899 | 887.215 | 360.378 | 599.057 | 443.006 | 588.595 | 742.594 | 0.0 | 0.0 | 0 | 0 | 4 | 0.0 | 0.0 | NaN | vertical | horizontal | horizontal | horizontal | vertical | none |
38397 | 1c | 9 | 100149 | 25 | 61 | 1597 | 18 | false | NA | variable | homogeneous | 4 | absent | 0.552039 | 1 | vertical | vertical | 1 | 0.799998 | green | 1 | NaN | 887.215 | 839.622 | NaN | NaN | 887.215 | 839.622 | 1039.62 | 512.785 | 712.785 | 360.378 | 560.378 | 687.215 | 0.0 | 0.0 | 0.0 | 0.0 | 4 | 0 | 0 | 0.0 | NaN | NaN | vertical | vertical | horizontal | vertical | none | none |
38398 | 1c | 9 | 100149 | 25 | 62 | 1598 | 18 | false | NA | variable | homogeneous | 3 | absent | 0.551724 | 1 | horizontal | horizontal | 1 | 0.635294 | green | 2 | NaN | 973.91 | 413.505 | NaN | NaN | 874.555 | 843.858 | 973.91 | 413.505 | 551.536 | 542.637 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 4 | 0 | NaN | NaN | NaN | horizontal | horizontal | horizontal | none | none | none |
38399 | 1c | 9 | 100149 | 25 | 63 | 1599 | 18 | false | NA | variable | homogeneous | 5 | absent | 0.575129 | 1 | vertical | vertical | 1 | 0.788233 | green | 1 | NaN | 708.616 | 838.063 | NaN | NaN | 708.616 | 838.063 | 998.172 | 760.477 | 1013.86 | 461.117 | 734.001 | 353.689 | 545.349 | 586.654 | 0.0 | 0.0 | 4 | 0 | 0 | 0.0 | 0.0 | NaN | vertical | horizontal | vertical | horizontal | vertical | none |
38400 | 1c | 9 | 100149 | 25 | 64 | 1600 | 18 | false | NA | variable | homogeneous | 3 | absent | 0.599166 | 1 | horizontal | horizontal | 1 | 0.811762 | red | 2 | NaN | 1023.03 | 476.374 | NaN | NaN | 795.55 | 854.961 | 1023.03 | 476.374 | 581.422 | 468.665 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 4 | 0 | NaN | NaN | NaN | vertical | horizontal | vertical | none | none | none |
Once it’s imported, we can take a look at our data, e.g., looking at subject means between the two conditions:
@chain df begin
@subset(
:acc.==1,
:set_size.==4)
@transform(
:rt = :rt .* 1000,
:subject = string.(:subject)
)
groupby([:subject, :distractor])
@combine(
:rt = mean(:rt))
plot(
layer(
x=:distractor,
y=:rt,
color=:subject,
Geom.point
),
layer(
x=:distractor,
y=:rt,
color=:subject,
Geom.line
),
Theme(key_position=:none)
)
end
We can clearly see typical atttentional capture effects in the data. Now that we have the data, let’s model it:
d = @chain df begin
@subset(
:acc.==1,
:set_size.==4)
@transform(
:rt = :rt.*1000,
:subject = string.(:subject))
end
# Our model is simple: RT is dependent on distractor presence, with a random slope and intercept for each subject. More complex models are left as an exercise to the reader.
formula = @formula(rt ~ 1 + distractor + (1 + distractor | subject))
fm1 = fit(MixedModel, formula, d, REML=true)
Est. | SE | z | p | σ_subject | |
---|---|---|---|---|---|
(Intercept) | 651.5000 | 16.6627 | 39.10 | <1e-99 | 80.6316 |
distractor: present | 30.8566 | 4.7527 | 6.49 | <1e-10 | 14.7696 |
Residual | 175.8140 |
Column | Variance | Std.Dev | Corr. | |
---|---|---|---|---|
subject | (Intercept) | 6501.4475 | 80.6316 | |
distractor: present | 218.1408 | 14.7696 | +0.35 | |
Residual | 30910.5789 | 175.8140 |
We can also check whether the addition of the distractor term is even useful in describing the data: does it explain significantly more variance than the intercept only model?
fm1 = fit(MixedModel, formula, d)
formula_null = @formula(rt ~ 1 + (1 + distractor | subject))
fm1_null = fit(MixedModel, formula_null, d)
lrt = MixedModels.likelihoodratiotest(fm1,fm1_null)
model-dof | deviance | χ² | χ²-dof | P(>χ²) | |
---|---|---|---|---|---|
rt ~ 1 + (1 + distractor | subject) | 5 | 120892 | |||
rt ~ 1 + distractor + (1 + distractor | subject) | 6 | 120867 | 25 | 1 | <1e-06 |
The above model rt ~ distractor + ( distractor | subject)
is our putative data generating process, the parameters that we believe underly the generation of observed dependent variables, and the relationship between those parameters. The table shown above gives us parameter estimates for all fixed and random effects in the model. Now let’s plug those parameters into a simulation!
n_subj = 10 # number of subjects
n_present = 200 # number of distractor present trials
n_absent = 200 # number of distractor absent
beta_0 = 650 # Intercept
beta_1 = 30 # effect of category
tau_0 = 80 # by-subject random intercept sd
tau_1 = 15 # by-subject random slope sd
rho = 0.35 # correlation between intercept and slope
sigma = 175 # residual nose
175
Generate trials with their fixed effects:
# simulate a sample of items
# total number of items = n_ingroup + n_outgroup
items = @chain DataFrame(
:item_id => range(1,n_absent+n_present),
:category => [repeat(["absent"],n_absent)..., repeat(["present"], n_present)...]
) begin
@rtransform :X_i = :category == "present" ? 1 : 0
end
# items.describe()
Row | item_id | category | X_i |
---|---|---|---|
Int64 | String | Int64 | |
1 | 1 | absent | 0 |
2 | 2 | absent | 0 |
3 | 3 | absent | 0 |
4 | 4 | absent | 0 |
5 | 5 | absent | 0 |
6 | 6 | absent | 0 |
7 | 7 | absent | 0 |
8 | 8 | absent | 0 |
9 | 9 | absent | 0 |
10 | 10 | absent | 0 |
11 | 11 | absent | 0 |
12 | 12 | absent | 0 |
13 | 13 | absent | 0 |
⋮ | ⋮ | ⋮ | ⋮ |
389 | 389 | present | 1 |
390 | 390 | present | 1 |
391 | 391 | present | 1 |
392 | 392 | present | 1 |
393 | 393 | present | 1 |
394 | 394 | present | 1 |
395 | 395 | present | 1 |
396 | 396 | present | 1 |
397 | 397 | present | 1 |
398 | 398 | present | 1 |
399 | 399 | present | 1 |
400 | 400 | present | 1 |
And generate participants with their random intercepts and slopes:
#simulate a sample of subjects
#calculate random intercept / random slope covariance
covar = rho * tau_0 * tau_1
#put values into variance-covariance matrix
cov_mx = [tau_0^2 covar; covar tau_1^2]
#generate the by-subject random effects
dist = MvNormal([0, 0], cov_mx)
subject_rfx = rand(dist, n_subj)
# # combine with subject IDs
subjects = DataFrame(
:subj_id => string.(range(1,n_subj)),
:T_0s => subject_rfx[1,:],
:T_1s => subject_rfx[2,:]
)
Row | subj_id | T_0s | T_1s |
---|---|---|---|
String | Float64 | Float64 | |
1 | 1 | -67.9967 | -4.29486 |
2 | 2 | 105.907 | -1.89717 |
3 | 3 | 73.7466 | 2.99368 |
4 | 4 | 156.603 | 14.0334 |
5 | 5 | 40.9051 | -2.73222 |
6 | 6 | 34.1562 | 11.3813 |
7 | 7 | -74.5877 | -5.30155 |
8 | 8 | -80.6553 | 21.5753 |
9 | 9 | 126.907 | 34.0272 |
10 | 10 | 57.3057 | 17.3225 |
Now combine and add residual noise to create a complete dataframe:
#cross items and subjects, add noise
dat_sim = @chain crossjoin(subjects,items) begin
@rtransform @astable begin
:e_si = rand(Normal(0, sigma), 1)[1]
#calculate response variable
:RT = beta_0 + :T_0s + (beta_1 + :T_1s) * :X_i + :e_si
end
@select(:subj_id, :item_id, :category, :X_i, :RT)
end
Row | subj_id | item_id | category | X_i | RT |
---|---|---|---|---|---|
String | Int64 | String | Int64 | Float64 | |
1 | 1 | 1 | absent | 0 | 516.535 |
2 | 1 | 2 | absent | 0 | 757.313 |
3 | 1 | 3 | absent | 0 | 544.622 |
4 | 1 | 4 | absent | 0 | 655.981 |
5 | 1 | 5 | absent | 0 | 691.763 |
6 | 1 | 6 | absent | 0 | 523.838 |
7 | 1 | 7 | absent | 0 | 510.876 |
8 | 1 | 8 | absent | 0 | 690.3 |
9 | 1 | 9 | absent | 0 | 792.548 |
10 | 1 | 10 | absent | 0 | 713.432 |
11 | 1 | 11 | absent | 0 | 825.375 |
12 | 1 | 12 | absent | 0 | 627.622 |
13 | 1 | 13 | absent | 0 | 667.263 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
3989 | 10 | 389 | present | 1 | 531.323 |
3990 | 10 | 390 | present | 1 | 800.412 |
3991 | 10 | 391 | present | 1 | 527.266 |
3992 | 10 | 392 | present | 1 | 787.222 |
3993 | 10 | 393 | present | 1 | 664.524 |
3994 | 10 | 394 | present | 1 | 701.211 |
3995 | 10 | 395 | present | 1 | 562.528 |
3996 | 10 | 396 | present | 1 | 677.546 |
3997 | 10 | 397 | present | 1 | 527.31 |
3998 | 10 | 398 | present | 1 | 686.81 |
3999 | 10 | 399 | present | 1 | 736.427 |
4000 | 10 | 400 | present | 1 | 667.098 |
Data generated! Does it look like we’d expect?
@chain dat_sim begin
groupby([:subj_id, :category])
@combine(
:RT = mean(:RT))
plot(
layer(
x=:category,
y=:RT,
color=:subj_id,
Geom.point
),
layer(
x=:category,
y=:RT,
color=:subj_id,
Geom.line
),
Theme(key_position=:none)
)
end
Looks comparable to the original data! Now let’s fit a model to see if we recover the parameters:
simformula = @formula(RT ~ 1 + category + (1 + category | subj_id))
fm2 = fit(MixedModel, simformula, dat_sim, REML=true)
Est. | SE | z | p | σ_subj_id | |
---|---|---|---|---|---|
(Intercept) | 689.6254 | 25.3951 | 27.16 | <1e-99 | 79.3484 |
category: present | 37.5555 | 7.1469 | 5.25 | <1e-06 | 14.3157 |
Residual | 174.8844 |
Column | Variance | Std.Dev | Corr. | |
---|---|---|---|---|
subj_id | (Intercept) | 6296.1711 | 79.3484 | |
category: present | 204.9394 | 14.3157 | +0.81 | |
Residual | 30584.5494 | 174.8844 |
And as a quick test: does the full model explain the data significantly better than an intercept only model?
fm2 = fit(MixedModel, simformula, dat_sim)
simformula_null = @formula(RT ~ 1 + (category | subj_id))
fm2_null = fit(MixedModel, simformula_null, dat_sim)
lrt_sim = MixedModels.likelihoodratiotest(fm2,fm2_null)
model-dof | deviance | χ² | χ²-dof | P(>χ²) | |
---|---|---|---|---|---|
RT ~ 1 + (1 + category | subj_id) | 5 | 52724 | |||
RT ~ 1 + category + (1 + category | subj_id) | 6 | 52710 | 14 | 1 | 0.0002 |
Great, our simulation works - we recover our ground truth for the different coefficients (allowing for differences due to noise and limited sample size). Now for a power analysis, we’d put the above in functions and run the code many times for a given combination of parameters. See below:
function my_sim_data(;
n_subj = 5, # number of subjects
n_present = 200, # number of distractor present trials
n_absent = 200, # number of distractor absent
beta_0 = 650, # Intercept
beta_1 = 30, # effect of category
tau_0 = 80, # by-subject random intercept sd
tau_1 = 15, # by-subject random slope sd
rho = 0.35, # correlation between intercept and slope
sigma = 175 # residual noise
)
# simulate a sample of items
# total number of items = n_ingroup + n_outgroup
items = @chain DataFrame(
:item_id => range(1,n_absent+n_present),
:category => [repeat(["absent"],n_absent)..., repeat(["present"], n_present)...]
) begin
@rtransform :X_i = :category == "present" ? 1 : 0
end
#simulate a sample of subjects
#calculate random intercept / random slope covariance
covar = rho * tau_0 * tau_1
#put values into variance-covariance matrix
cov_mx = [tau_0^2 covar; covar tau_1^2]
#generate the by-subject random effects
dist = MvNormal([0, 0], cov_mx)
subject_rfx = rand(dist, n_subj)
# # combine with subject IDs
subjects = DataFrame(
:subj_id => string.(range(1,n_subj)),
:T_0s => subject_rfx[1,:],
:T_1s => subject_rfx[2,:]
)
dat_sim = @chain crossjoin(subjects,items) begin
@rtransform @astable begin
:e_si = rand(Normal(0, sigma), 1)[1]
#calculate response variable
:RT = beta_0 + :T_0s + (beta_1 + :T_1s) * :X_i + :e_si
end
@select(:subj_id, :item_id, :category, :X_i, :RT)
end
dat_sim
end
my_sim_data (generic function with 1 method)
The above function simulates data. The function below combines it with a model fit so we have a function that can be repeatedly called during our power analysis.
function single_run(filename = nothing, args...; kwargs...)
dat_sim = my_sim_data(;kwargs...)
simformula = @formula(RT ~ 1 + category + (1 + category | subj_id))
simformula_null = @formula(RT ~ 1 + (category | subj_id))
fm = fit(MixedModel, simformula, dat_sim)
fm_null = fit(MixedModel, simformula_null, dat_sim)
lrt = MixedModels.likelihoodratiotest(fm,fm_null)
sim_results = DataFrame(coeftable(fm))
# the p-values calculated by MixedModels are uncorrected and therefore inflated - we use the best (reasonable) alternative method available to us, the likelihood ratio test vs the intercept-only model, to find a better estimate (note that we leave the intercept p value since it isn't of interest here)
sim_results[!,"Pr(>|z|)"][2] = lrt.pvalues[1]
if length([kwargs...]) != 0
sim_results = crossjoin(sim_results,DataFrame(kwargs...))
end
if !isnothing(filename)
append = isfile(filename)
CSV.write(filename, sim_results, append = append)
end
sim_results
end
single_run (generic function with 2 methods)
Now let’s run our sensitivity analysis - we will run our simulation many times (1000+ times, which we can show here because of Julia’s blazing speed) for each combination of parameters, and record how often the fixed effect estimates reach significance:
nreps = 1000
params = allcombinations(
DataFrame,
:rep => 1:nreps, # number of runs
:n_subj => 8, # number of subjects
:n_present => 150, # number of distractor present trials
:n_absent => 150, # number of distractor absent
:beta_0 => 650, # Intercept
:beta_1 => 30, # effect of category
:tau_0 => 80, # by-subject random intercept sd
:tau_1 => 15, # by-subject random slope sd
:rho => 0.35, # correlation between intercept and slope
:sigma => 175 # residual (standard deviation)
)
select!(params,Not("rep"))
alpha = .05
sims = vcat([single_run(;r...) for r in Tables.rowtable(params)]...)
@chain sims begin
@select(:Name, $"Pr(>|z|)")
@transform :p = $"Pr(>|z|)" .< alpha
groupby(:Name)
@combine :power = mean(:p)
end
Row | Name | power |
---|---|---|
String | Float64 | |
1 | (Intercept) | 1.0 |
2 | category: present | 0.874 |
If we want to run our sensitivity analysis across a given parameter space, we’ll have to map the function single_run to generate data across this space, for example, over a varying number of participants:
filename1 = "sens_jl.csv"
nreps = 1000
params = allcombinations(
DataFrame,
:rep => 1:nreps, # number of runs
:n_subj => 2:15, # number of subjects
:n_present => 150, # number of distractor present trials
:n_absent => 150, # number of distractor absent
:beta_0 => 650, # Intercept
:beta_1 => 30, # effect of category
:tau_0 => 80, # by-subject random intercept sd
:tau_1 => 15, # by-subject random slope sd
:rho => 0.35, # correlation between intercept and slope
:sigma => 175 # residual (standard deviation)
)
select!(params,Not("rep"))
alpha = 0.05
if !isfile(filename1)
sims = vcat([single_run(filename1; r...) for r in Tables.rowtable(params)]...)
end
Note that the above could obviously also be run over other dimensions of our parameter space, e.g. for different estimates of the fixed effects, amount of noise, number of trials, etc. etc., by changing the params
list. How did we do? Let’s take a look at our power curve.
sims1 = CSV.read("sens_jl.csv", DataFrame)
power1 = @chain sims1 begin
@subset :Name .== "category: present"
@transform :p = $"Pr(>|z|)" .< alpha
groupby([:Name, :n_subj])
@combine begin
:mean_estimate = mean($"Coef.")
:mean_se = mean($"Std. Error")
:power = mean(:p)
end
end
Row | Name | n_subj | mean_estimate | mean_se | power |
---|---|---|---|---|---|
String31 | Int64 | Float64 | Float64 | Float64 | |
1 | category: present | 2 | 30.2635 | 17.9024 | 0.205 |
2 | category: present | 3 | 29.526 | 14.6008 | 0.356 |
3 | category: present | 4 | 29.9583 | 12.6264 | 0.5 |
4 | category: present | 5 | 30.1647 | 11.1628 | 0.647 |
5 | category: present | 6 | 29.9987 | 10.2055 | 0.743 |
6 | category: present | 7 | 30.1957 | 9.33679 | 0.823 |
7 | category: present | 8 | 29.9215 | 8.80461 | 0.874 |
8 | category: present | 9 | 30.7667 | 8.2574 | 0.924 |
9 | category: present | 10 | 30.071 | 7.75013 | 0.956 |
10 | category: present | 11 | 29.4491 | 7.43248 | 0.945 |
11 | category: present | 12 | 30.256 | 7.10302 | 0.971 |
12 | category: present | 13 | 29.7838 | 6.819 | 0.983 |
13 | category: present | 14 | 29.8746 | 6.62874 | 0.983 |
14 | category: present | 15 | 29.8489 | 6.42804 | 0.997 |
plot(
power1,
layer(
x=:n_subj,
y=:power,
Geom.smooth,
yintercept=[0.8],
Geom.hline
),
layer(
x=:n_subj,
y=:power,
Geom.point
)
)
Our power analysis has determined that, with the parameters established above, we need ~8 or more participants to reliably detect an effect!
The code used above is specific to power analysis for mixed models, but the approach generalises to other methods too, of course! The above code can easily be wrangled to handle different model types (simply change the model definition in single_run
and make sure to capture the right parameters), and even Bayesian approaches. (For a thorough example of doing power analysis with Bayesian methods and the awesome bayesian regression package brms
, see Kurz, 2021.)
Even if the above code is spaghetti to you (Perhaps you prefer R? Or Python?), I hope you will take away a few things from this tutorial: