How we validate MCPower's custom C++ mixed-effects solver
against R's lme4 package across 75+ scenarios
MCPower is a Monte Carlo power analysis library for complex statistical models. It estimates the probability (power) that a study will detect a true effect at a given sample size, using simulation rather than closed-form formulas.
Its LME (Linear Mixed-Effects) module handles clustered/hierarchical data —
students nested in schools, patients nested in hospitals, repeated measures within subjects.
Instead of relying on R or Python's statsmodels, MCPower uses a
custom C++ solver based on the profiled-deviance approach from
Bates et al. (2015), achieving roughly 200× speedup over pure Python.
| Model Type | R Formula | Solver |
|---|---|---|
| Random intercept | y ~ x + (1|group) |
Brent's 1D optimization, block-diagonal structure |
| Random slopes | y ~ x + (1 + x|group) |
L-BFGS-B with G-matrix parameterization |
| Nested effects | y ~ x + (1|school/class) |
Two-level Woodbury decomposition |
Statistical inference uses Likelihood Ratio (LR) tests for overall significance
and Wald z-tests (with REML) for individual fixed effects —
the same testing framework that lmerTest provides in R.
R's lme4 package (Bates, Mächler, Bolker & Walker, 2015) is the
gold standard for fitting linear mixed-effects models.
It is used in tens of thousands of published studies across psychology, education,
ecology, and medicine. By validating MCPower's solver against lme4, we establish
that our custom implementation produces statistically equivalent results.
The framework uses four independent, complementary strategies that together test the solver, the data-generating process (DGP), and end-to-end power estimation:
Generate data externally with NumPy (independent of both MCPower and R). Feed the same datasets to both solvers. Compare significance decisions.
Tests: solver correctness in isolation
Generate data via MCPower's own pipeline. Fit with both MCPower and
lme4. Compare significance decisions.
Tests: MCPower's data-generating process + solver together
Both MCPower and R independently generate data and compute power (proportion of significant results across N simulations). Compare power estimates.
Tests: end-to-end power accuracy (DGP + solver + aggregation)
Runs independently — reuses S3 power estimates when available, but runs its own parallel power simulations for scenarios not covered by S3. Applies a two-proportion z-test to each scenario, then Benjamini-Hochberg FDR correction across all scenarios. Replaces the fixed 5pp threshold with a proper statistical test.
Tests: whether power differences are statistically significant (not just sampling noise)
Note: Sensitivity scenarios (~50) use only S4 — S1/S2/S3 are reserved for core scenarios.
This strategy isolates the solver from everything else. Data is generated by a third party (NumPy) so neither MCPower nor R influences the data-generating process. Both solvers receive identical datasets and we compare their conclusions.
100 datasets per scenario: 50 with true effect (β ≠ 0) and 50 null (β = 0).
Predictors are i.i.d. N(0, 1). Random intercepts drawn from N(0, τ²).
All written to a single CSV file with a dataset_id column.
Parse CSV, group by dataset_id, call _lme_analysis_wrapper()
for each dataset. Extract: converged, significant (LR test), individual significance (Wald).
Call Rscript fit_lme4.R with the same CSV. R fits
lmer() with REML (via lmerTest for p-values),
then performs an LR test (full vs. intercept-only null, ML refit).
Results returned as JSON on stdout.
Significance agreement: proportion of datasets where both solvers
agree on reject/fail-to-reject. Threshold: ≥ 95%.
Cohen's κ: agreement corrected for chance from the 2×2 table.
Parameter recovery: Pearson correlations of β, SE, τ², σ² estimates.
Variance decomposition follows from the ICC definition. Since predictors are standardized N(0, 1), the explained variance by fixed effects equals Σβk². The within-group variance is:
where the random effects vector [b0j, b1j] follows:
Two independent random effects: school-level (ICCschool) and classroom-within-school (ICCclassroom), each with their own variance component derived from the respective ICC.
# Fit REML model (lmerTest provides Satterthwaite p-values) fit <- lmer(as.formula(formula_str), data = d, REML = TRUE, control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 50000))) # Extract fixed effects (skip intercept) coefs <- summary(fit)$coefficients beta <- coefs[-1, "Estimate"] se <- coefs[-1, "Std. Error"] pvalues <- coefs[-1, "Pr(>|t|)"] # Variance components vc <- as.data.frame(VarCorr(fit)) tau2 <- vc$vcov[vc$grp == "g"] sigma2 <- sigma(fit)^2 # Likelihood Ratio Test: full vs null (intercept-only + RE) null_fit <- lmer(y ~ 1 + (1 | g), data = d, REML = FALSE, control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 50000))) full_ml <- update(fit, REML = FALSE) lr_test <- anova(null_fit, full_ml) lr_pvalue <- lr_test$`Pr(>Chisq)`[2] significant <- !is.na(lr_pvalue) && lr_pvalue < 0.05
bobyqa optimizer (robust for mixed models), max 50,000 function evaluations.
Strategy 1 tests the solver in isolation using third-party data. Strategy 2 adds MCPower's data-generating process to the test. Data is generated through MCPower's full pipeline (the same code path used during power analysis), then both MCPower and R fit models on these datasets.
Uses MCPower's internal backend: predictor generation, factor handling, random effect sampling, and response generation. Writes 100 datasets to CSV in the same stacked format.
MCPower already fitted each dataset during generation.
R fits the same datasets via lme4.
Same agreement metric as Strategy 1. If Strategy 1 passes but Strategy 2 fails, it points to a problem in MCPower's data generation, not the solver.
The ultimate test: do MCPower and R produce the same power estimates when each generates data and fits models completely independently?
Both systems use the same statistical model specification (ICC, effect sizes, sample size, number of clusters) but use independent RNG streams and independent DGP implementations.
Calls model.find_power(sample_size, n_sims=500) internally.
Returns power as proportion of significant results.
simulate_power_lme4.R implements the same DGP model in R:
generates data, fits lmer(), counts rejections.
Uses a different seed (seed + 1,000,000) than MCPower.
Compute |PowerMCPower − PowerR|. Pass if difference ≤ 5 percentage points (accounting for Monte Carlo sampling noise).
# Compute variance components from ICC (same formula as MCPower) sigma2_fixed <- sum(effect_values^2) sigma2_within <- 1.0 + sigma2_fixed tau2 <- icc / (1 - icc) * sigma2_within for (sim in 1:n_sims) { # Generate cluster assignments g <- rep(1:n_clusters, each = cluster_size) # Generate predictors ~ N(0, 1) x1 <- rnorm(n_obs) # Random intercepts ~ N(0, tau2) b <- rnorm(n_clusters, 0, sqrt(tau2)) # Response: y = X*beta + b[group] + error y <- beta * x1 + b[g] + rnorm(n_obs) # Fit & test via LRT fit <- lmer(y ~ x1 + (1 | g), REML = FALSE) null <- lmer(y ~ 1 + (1 | g), REML = FALSE) lr_p <- anova(null, fit)$`Pr(>Chisq)`[2] if (!is.na(lr_p) && lr_p < 0.05) n_significant += 1 } power <- n_significant / n_converged
Strategy 3 compares MCPower and R power estimates using a fixed threshold (|diff| ≤ 5 percentage points). With 500 Monte Carlo simulations, sampling noise alone can cause ~2–3pp deviations, leading to occasional false failures in sensitivity scenarios near the threshold boundary. Strategy 4 replaces this fixed threshold with a proper statistical test.
For each scenario, we test H0: PowerMCPower = PowerR using the standard two-proportion z-test:
where the pooled standard error under H0 is:
and p̂pool = (p̂MC · nMC + p̂R · nR) / (nMC + nR) is the pooled proportion. The two-sided p-value is computed via the complementary error function: p = erfc(|z| / √2).
Edge case: When both tools agree on extreme power (0% or 100%), the pooled SE is zero. In this case z = 0, p = 1.0, and the scenario trivially passes.
With ~50 scenarios tested simultaneously, multiple testing inflation would produce false positives. The BH procedure controls the False Discovery Rate at level α = 0.05:
Sort all m p-values in ascending order: p(1) ≤ p(2) ≤ … ≤ p(m)
For each rank k: padj(k) = min( p(k) · m / k, 1.0 )
Working from the largest rank downward, ensure padj(k) ≤ padj(k+1)
PASS if adjusted p-value > 0.05 (cannot reject H0: powers are equal)
| Result | Meaning |
|---|---|
| PASS (adj. p > 0.05) | Cannot reject that MCPower and R produce equal power. The observed difference is consistent with Monte Carlo sampling noise. |
| FAIL (adj. p ≤ 0.05) | The power difference is statistically significant even after FDR correction, suggesting a systematic discrepancy between the two implementations. |
The validation uses two scenario categories: ~27 core scenarios (tested with all four strategies S1–S4) and ~50 sensitivity scenarios (tested with S4 only). Core scenarios verify solver correctness with well-powered designs. Sensitivity scenarios span the full 5%–90% power range to validate power curve accuracy.
| Parameter | Values |
|---|---|
| Formula | y ~ x1 + (1|g) and y ~ x1 + x2 + (1|g) |
| ICC | 0.1, 0.2, 0.5 |
| Clusters | 10, 20, 50 |
| Total N | 500, 1000 |
| Effect sizes | Small (β = 0.3), Medium (β = 0.5), Mixed two-predictor |
Example name: intercept_icc0.2_cl20_n1000_medium
| Parameter | Values |
|---|---|
| Formula | y ~ x1 + (1 + x1|g) |
| ICC | 0.2 |
| Clusters × N | 20×1000, 30×1500, 20×2000, 50×2000 |
| Slope variance | 0.1 |
| Slope-intercept correlation | 0.3 |
| Effect size | β = 0.5 |
| Parameter | Values |
|---|---|
| Formula | y ~ x1 + (1|school/classroom) |
| ICCschool | 0.15 |
| ICCclassroom | 0.10 |
| Schools × classrooms/school | 10×3, 15×4, 20×3 |
| Total N | 1500, 2400, 1800 |
| Effect size | β = 0.5 |
Sensitivity scenarios systematically sweep effect sizes, sample sizes, cluster counts, and ICCs to cover the full power spectrum. They use only Strategy 4 (statistical z-test + FDR) because S1/S2/S3's fixed thresholds produce false failures at borderline power levels.
| Category | Count | Power Range | Key Variations |
|---|---|---|---|
| Intercept sweep | 24 | ~5%–90% | N: 50–500, K: 5–20, β: 0.05–0.30, ICC: 0.2–0.4 |
| Two-predictor | 8 | ~10%–75% | N: 100–400, K: 10–15, two effect sizes varied |
| Random slopes | 10 | ~5%–80% | N: 150–500, K: 10–20, β: 0.10–0.25 |
| Nested effects | 8 | ~5%–70% | N: 100–600, K: 5–10, npp: 2–5 |
Each scenario gets a PASS or FAIL verdict based on the following thresholds:
| Metric | Threshold | Strategies | Rationale |
|---|---|---|---|
| Significance agreement | ≥ 95% | S1, S2 | Allows up to 5% disagreement due to borderline cases near α = 0.05 |
| β estimate correlation | r ≥ 0.98 | S1 | Fixed-effect estimates should be nearly identical |
| SE estimate correlation | r ≥ 0.95 | S1 | Relaxed slightly — SEs are more sensitive to optimizer differences |
| τ² estimate correlation | r ≥ 0.95 | S1 | Random-effect variance estimates should correlate highly |
| Power difference | ≤ 5 pp | S3 | Accounts for Monte Carlo sampling variability (~2–3 pp at 500 sims) |
| Adjusted p-value (BH FDR) | > 0.05 | S4 | Two-proportion z-test with FDR correction — cannot reject equal power |
| Type I error rate | 3%–7% | S1 | Null datasets should reject at approximately α = 5% |
For R users who want to understand exactly what lme4 configuration is being compared against.
| Package | Purpose |
|---|---|
lme4 |
Core mixed-effects model fitting via lmer() |
lmerTest |
Extends lmer summary to include Satterthwaite p-values for fixed effects |
jsonlite |
JSON serialization for R↔Python data exchange |
bobyqa (bound optimization by quadratic approximation)optCtrl = list(maxfun = 50000)tryCatch(), marked as unconverged, excluded from comparisons# Null model: intercept + random effects only null_fit <- lmer(y ~ 1 + (1 | g), data = d, REML = FALSE) full_ml <- update(fit, REML = FALSE) lr_test <- anova(null_fit, full_ml) # Chi-squared test with df = number of fixed effects
The null model retains the random-effects structure but drops all fixed-effect predictors. Both models are refit with ML (not REML) because REML likelihoods are not comparable across models with different fixed effects.
Via lmerTest::summary() which adds Satterthwaite-approximated
degrees of freedom and p-values to the standard lmer coefficient table.
The Wald statistic is t = β̂ / SE(β̂) with Satterthwaite df.
R scripts are invoked via subprocess.run(["Rscript", "--vanilla", script_path, ...])
with a 600-second timeout. Data passes as CSV files; results return as JSON on stdout.
Stderr is reserved for R warnings/messages and discarded unless an error occurs.
A smart caching layer hashes the scenario parameters + R script contents (SHA-256) so unchanged scenarios skip re-running R, which is critical since a full validation run with all strategies can take 30+ minutes.
pip install -e MCPower/)lme4, lmerTest, jsonliteRscript available on $PATH# Run all strategies on all scenarios (full validation) python LME4-validation/run_validation.py # Run only Strategy 1 (solver comparison) python LME4-validation/run_validation.py --strategy 1 # Run specific scenarios by name prefix python LME4-validation/run_validation.py --scenarios intercept_icc0.2 # Run with more datasets (default: 100) or simulations (default: 500) python LME4-validation/run_validation.py --n-datasets 200 --n-sims 1000 # Clear cached R results and re-run from scratch python LME4-validation/run_validation.py --clear-cache # Disable caching entirely python LME4-validation/run_validation.py --no-cache # Set custom random seed (default: 42) python LME4-validation/run_validation.py --seed 123
Running validation: 27 scenarios, strategies [1, 2, 3, 4] n_datasets=100, n_sims_s3=500, seed=42 ============================================================ Strategy 1: External Data Agreement ============================================================ [S1] intercept_icc0.2_cl20_n1000_small ... PASS (agree=100.0%, 12.4s) [S1] intercept_icc0.2_cl20_n1000_medium ... PASS (agree=100.0%, 11.8s) ... ============================================================ Strategy 3: Parallel Power Simulation ============================================================ [S3] intercept_icc0.2_cl20_n1000_medium ... PASS (MC=0.847, R=0.832, 45.2s) ... ============================================================ Strategy 4: Statistical Power Comparison (Z-test + FDR) ============================================================ [S4] intercept_icc0.2_cl20_n1000_medium: PASS (z=0.672, adj_p=0.8921) ... ============================================================ DONE: 27/27 scenarios passed (892.3s total) Report: LME4-validation/reports/validation_report_20260222_143201.html ============================================================
After each run, a self-contained HTML report is generated in
LME4-validation/reports/.
The report includes:
Reports are fully self-contained (inline CSS, no external dependencies) and can be shared directly. Each cell is color-coded green (pass) or red (fail) for quick scanning.
--seed produces identical results (assuming R and MCPower versions haven't changed).
The caching system's SHA-256 keys incorporate the R script source, so updating an R script
automatically invalidates the relevant cache entries.