MCPower LME Validation Process

How we validate MCPower's custom C++ mixed-effects solver against R's lme4 package across 75+ scenarios

lme4 / lmerTest Monte Carlo Simulation 4 Independent Strategies

1. What is MCPower?

MCPower is a Monte Carlo power analysis library for complex statistical models. It estimates the probability (power) that a study will detect a true effect at a given sample size, using simulation rather than closed-form formulas.

Its LME (Linear Mixed-Effects) module handles clustered/hierarchical data — students nested in schools, patients nested in hospitals, repeated measures within subjects. Instead of relying on R or Python's statsmodels, MCPower uses a custom C++ solver based on the profiled-deviance approach from Bates et al. (2015), achieving roughly 200× speedup over pure Python.

⚙

Three solver paths

All implemented in C++ via pybind11 + Eigen

Model Type	R Formula	Solver
Random intercept	`y ~ x + (1\|group)`	Brent's 1D optimization, block-diagonal structure
Random slopes	`y ~ x + (1 + x\|group)`	L-BFGS-B with G-matrix parameterization
Nested effects	`y ~ x + (1\|school/class)`	Two-level Woodbury decomposition

Statistical inference uses Likelihood Ratio (LR) tests for overall significance and Wald z-tests (with REML) for individual fixed effects — the same testing framework that lmerTest provides in R.

2. Why validate against lme4?

R's lme4 package (Bates, Mächler, Bolker & Walker, 2015) is the gold standard for fitting linear mixed-effects models. It is used in tens of thousands of published studies across psychology, education, ecology, and medicine. By validating MCPower's solver against lme4, we establish that our custom implementation produces statistically equivalent results.

What we're testing: fixed-effect estimates (β), standard errors, significance decisions, variance components (τ², σ²), and power curves. We do not expect bit-identical results — different optimizers may converge to slightly different parameter values — but the conclusions should match.

3. Validation framework overview

The framework uses four independent, complementary strategies that together test the solver, the data-generating process (DGP), and end-to-end power estimation:

Strategy 1

External Data Agreement

Generate data externally with NumPy (independent of both MCPower and R). Feed the same datasets to both solvers. Compare significance decisions.

Tests: solver correctness in isolation

Strategy 2

MCPower DGP Validation

Generate data via MCPower's own pipeline. Fit with both MCPower and lme4. Compare significance decisions.

Tests: MCPower's data-generating process + solver together

Strategy 3

Parallel Power Simulation

Both MCPower and R independently generate data and compute power (proportion of significant results across N simulations). Compare power estimates.

Tests: end-to-end power accuracy (DGP + solver + aggregation)

Strategy 4

Statistical Power Comparison (Z-test + FDR)

Runs independently — reuses S3 power estimates when available, but runs its own parallel power simulations for scenarios not covered by S3. Applies a two-proportion z-test to each scenario, then Benjamini-Hochberg FDR correction across all scenarios. Replaces the fixed 5pp threshold with a proper statistical test.

Tests: whether power differences are statistically significant (not just sampling noise)

Note: Sensitivity scenarios (~50) use only S4 — S1/S2/S3 are reserved for core scenarios.

4. Strategy 1 — External Data Agreement

This strategy isolates the solver from everything else. Data is generated by a third party (NumPy) so neither MCPower nor R influences the data-generating process. Both solvers receive identical datasets and we compare their conclusions.

Generate datasets with NumPy

100 datasets per scenario: 50 with true effect (β ≠ 0) and 50 null (β = 0). Predictors are i.i.d. N(0, 1). Random intercepts drawn from N(0, τ²). All written to a single CSV file with a dataset_id column.

Fit with MCPower's C++ solver

Parse CSV, group by dataset_id, call _lme_analysis_wrapper() for each dataset. Extract: converged, significant (LR test), individual significance (Wald).

Fit with R lme4 (via subprocess)

Call Rscript fit_lme4.R with the same CSV. R fits lmer() with REML (via lmerTest for p-values), then performs an LR test (full vs. intercept-only null, ML refit). Results returned as JSON on stdout.

Compare results

Significance agreement: proportion of datasets where both solvers agree on reject/fail-to-reject. Threshold: ≥ 95%.
Cohen's κ: agreement corrected for chance from the 2×2 table.
Parameter recovery: Pearson correlations of β, SE, τ², σ² estimates.

Data-Generating Process (DGP) — mathematical detail

Random Intercept Model

y_ij = X_ijβ + b_j + ε_ij where b_j ~ N(0, τ²), ε_ij ~ N(0, 1)

Variance decomposition follows from the ICC definition. Since predictors are standardized N(0, 1), the explained variance by fixed effects equals Σβ_k². The within-group variance is:

σ²_within = 1 + Σβ_k² ⇒ τ² = ICC / (1 − ICC) × σ²_within

Random Slopes Model

y_ij = X_ijβ + b_0j + b_1jx_1ij + ε_ij

where the random effects vector [b_0j, b_1j] follows:

[b_0j, b_1j]^T ~ N(0, G) with G = [τ²_int, ρτ_intτ_slope ; ρτ_intτ_slope, τ²_slope]

Nested Model

y_ijk = X_ijkβ + b_j^school + b_jk^classroom + ε_ijk

Two independent random effects: school-level (ICC_school) and classroom-within-school (ICC_classroom), each with their own variance component derived from the respective ICC.

R code used for fitting (fit_lme4.R)

# Fit REML model (lmerTest provides Satterthwaite p-values)
fit <- lmer(as.formula(formula_str), data = d, REML = TRUE,
            control = lmerControl(optimizer = "bobyqa",
                                    optCtrl = list(maxfun = 50000)))

# Extract fixed effects (skip intercept)
coefs <- summary(fit)$coefficients
beta    <- coefs[-1, "Estimate"]
se      <- coefs[-1, "Std. Error"]
pvalues <- coefs[-1, "Pr(>|t|)"]

# Variance components
vc    <- as.data.frame(VarCorr(fit))
tau2  <- vc$vcov[vc$grp == "g"]
sigma2 <- sigma(fit)^2

# Likelihood Ratio Test: full vs null (intercept-only + RE)
null_fit <- lmer(y ~ 1 + (1 | g), data = d, REML = FALSE,
                 control = lmerControl(optimizer = "bobyqa",
                                         optCtrl = list(maxfun = 50000)))
full_ml <- update(fit, REML = FALSE)
lr_test <- anova(null_fit, full_ml)
lr_pvalue <- lr_test$`Pr(>Chisq)`[2]
significant <- !is.na(lr_pvalue) && lr_pvalue < 0.05

Key choices: REML for parameter estimation (better small-sample properties), ML refit for the LR test (required for comparing models with different fixed effects), bobyqa optimizer (robust for mixed models), max 50,000 function evaluations.

5. Strategy 2 — MCPower DGP Validation

Strategy 1 tests the solver in isolation using third-party data. Strategy 2 adds MCPower's data-generating process to the test. Data is generated through MCPower's full pipeline (the same code path used during power analysis), then both MCPower and R fit models on these datasets.

MCPower generates datasets

Uses MCPower's internal backend: predictor generation, factor handling, random effect sampling, and response generation. Writes 100 datasets to CSV in the same stacked format.

Both solvers fit the same data

MCPower already fitted each dataset during generation. R fits the same datasets via lme4.

Compare significance decisions

Same agreement metric as Strategy 1. If Strategy 1 passes but Strategy 2 fails, it points to a problem in MCPower's data generation, not the solver.

Diagnostic power: By comparing Strategy 1 and Strategy 2 results, you can isolate whether a disagreement comes from the solver (would show up in both) or the DGP (would only show up in Strategy 2).

6. Strategy 3 — Parallel Power Simulation

The ultimate test: do MCPower and R produce the same power estimates when each generates data and fits models completely independently?

Both systems use the same statistical model specification (ICC, effect sizes, sample size, number of clusters) but use independent RNG streams and independent DGP implementations.

MCPower: generate & fit 500 simulations

Calls model.find_power(sample_size, n_sims=500) internally. Returns power as proportion of significant results.

R: generate & fit 500 simulations

simulate_power_lme4.R implements the same DGP model in R: generates data, fits lmer(), counts rejections. Uses a different seed (seed + 1,000,000) than MCPower.

Compare power estimates

Compute |Power_MCPower − Power_R|. Pass if difference ≤ 5 percentage points (accounting for Monte Carlo sampling noise).

R power simulation code (simulate_power_lme4.R)

# Compute variance components from ICC (same formula as MCPower)
sigma2_fixed <- sum(effect_values^2)
sigma2_within <- 1.0 + sigma2_fixed
tau2 <- icc / (1 - icc) * sigma2_within

for (sim in 1:n_sims) {
  # Generate cluster assignments
  g <- rep(1:n_clusters, each = cluster_size)

  # Generate predictors ~ N(0, 1)
  x1 <- rnorm(n_obs)

  # Random intercepts ~ N(0, tau2)
  b <- rnorm(n_clusters, 0, sqrt(tau2))

  # Response: y = X*beta + b[group] + error
  y <- beta * x1 + b[g] + rnorm(n_obs)

  # Fit & test via LRT
  fit  <- lmer(y ~ x1 + (1 | g), REML = FALSE)
  null <- lmer(y ~  1 + (1 | g), REML = FALSE)
  lr_p <- anova(null, fit)$`Pr(>Chisq)`[2]

  if (!is.na(lr_p) && lr_p < 0.05) n_significant += 1
}
power <- n_significant / n_converged

7. Strategy 4 — Statistical Power Comparison (Z-test + FDR)

Strategy 3 compares MCPower and R power estimates using a fixed threshold (|diff| ≤ 5 percentage points). With 500 Monte Carlo simulations, sampling noise alone can cause ~2–3pp deviations, leading to occasional false failures in sensitivity scenarios near the threshold boundary. Strategy 4 replaces this fixed threshold with a proper statistical test.

Independence: Strategy 4 operates independently of S1/S2/S3. It reuses S3 power estimates when available, but runs its own parallel power simulations for scenarios not covered by S3. This makes S4 the sole strategy for sensitivity scenarios (~50 scenarios spanning 5%–90% power), where S1/S2/S3's fixed thresholds are too rigid for borderline-power conditions.

Two-Proportion Z-test

For each scenario, we test H₀: Power_MCPower = Power_R using the standard two-proportion z-test:

z = (p̂_MC − p̂_R) / SE_diff

where the pooled standard error under H₀ is:

SE_diff = √( p̂_pool (1 − p̂_pool) (1/n_MC + 1/n_R) )

and p̂_pool = (p̂_MC · n_MC + p̂_R · n_R) / (n_MC + n_R) is the pooled proportion. The two-sided p-value is computed via the complementary error function: p = erfc(|z| / √2).

Edge case: When both tools agree on extreme power (0% or 100%), the pooled SE is zero. In this case z = 0, p = 1.0, and the scenario trivially passes.

Benjamini-Hochberg FDR Correction

With ~50 scenarios tested simultaneously, multiple testing inflation would produce false positives. The BH procedure controls the False Discovery Rate at level α = 0.05:

Rank p-values

Sort all m p-values in ascending order: p₍₁₎ ≤ p₍₂₎ ≤ … ≤ p_(m)

Compute adjusted p-values

For each rank k: p_adj(k) = min( p_(k) · m / k, 1.0 )

Enforce monotonicity

Working from the largest rank downward, ensure p_adj(k) ≤ p_adj(k+1)

Decision

PASS if adjusted p-value > 0.05 (cannot reject H₀: powers are equal)

Interpretation

Result	Meaning
PASS (adj. p > 0.05)	Cannot reject that MCPower and R produce equal power. The observed difference is consistent with Monte Carlo sampling noise.
FAIL (adj. p ≤ 0.05)	The power difference is statistically significant even after FDR correction, suggesting a systematic discrepancy between the two implementations.

Relationship to Strategy 3: For core scenarios that run both S3 and S4, the two are complementary: S3 flags large absolute differences (> 5pp) regardless of significance, while S4 tests whether any difference is statistically significant. For sensitivity scenarios, only S4 runs — the fixed 5pp threshold of S3 produces too many false failures at borderline power levels.

8. Scenario Matrix

The validation uses two scenario categories: ~27 core scenarios (tested with all four strategies S1–S4) and ~50 sensitivity scenarios (tested with S4 only). Core scenarios verify solver correctness with well-powered designs. Sensitivity scenarios span the full 5%–90% power range to validate power curve accuracy.

Random Intercept Scenarios (~20)

Parameter	Values
Formula	`y ~ x1 + (1\|g)` and `y ~ x1 + x2 + (1\|g)`
ICC	0.1, 0.2, 0.5
Clusters	10, 20, 50
Total N	500, 1000
Effect sizes	Small (β = 0.3), Medium (β = 0.5), Mixed two-predictor

Example name: intercept_icc0.2_cl20_n1000_medium

Random Slopes Scenarios (~4)

Parameter	Values
Formula	`y ~ x1 + (1 + x1\|g)`
ICC	0.2
Clusters × N	20×1000, 30×1500, 20×2000, 50×2000
Slope variance	0.1
Slope-intercept correlation	0.3
Effect size	β = 0.5

Nested Random Effects Scenarios (~3)

Parameter	Values
Formula	`y ~ x1 + (1\|school/classroom)`
ICC_school	0.15
ICC_classroom	0.10
Schools × classrooms/school	10×3, 15×4, 20×3
Total N	1500, 2400, 1800
Effect size	β = 0.5

Sensitivity Scenarios (~50, S4 only)

Sensitivity scenarios systematically sweep effect sizes, sample sizes, cluster counts, and ICCs to cover the full power spectrum. They use only Strategy 4 (statistical z-test + FDR) because S1/S2/S3's fixed thresholds produce false failures at borderline power levels.

Category	Count	Power Range	Key Variations
Intercept sweep	24	~5%–90%	N: 50–500, K: 5–20, β: 0.05–0.30, ICC: 0.2–0.4
Two-predictor	8	~10%–75%	N: 100–400, K: 10–15, two effect sizes varied
Random slopes	10	~5%–80%	N: 150–500, K: 10–20, β: 0.10–0.25
Nested effects	8	~5%–70%	N: 100–600, K: 5–10, npp: 2–5

Design rationale: The core scenario matrix covers conditions researchers commonly encounter — from small cluster counts (k=10, where the design effect is most impactful) to large ones (k=50), from low ICC (0.1, typical in multisite trials) to high ICC (0.5, common in educational data), and from small to medium effect sizes. The sensitivity scenarios extend validation to the full power spectrum, ensuring MCPower's power estimates are accurate even at borderline power levels where sampling noise is highest.

9. Pass/Fail Thresholds

Each scenario gets a PASS or FAIL verdict based on the following thresholds:

Metric	Threshold	Strategies	Rationale
Significance agreement	≥ 95%	S1, S2	Allows up to 5% disagreement due to borderline cases near α = 0.05
β estimate correlation	r ≥ 0.98	S1	Fixed-effect estimates should be nearly identical
SE estimate correlation	r ≥ 0.95	S1	Relaxed slightly — SEs are more sensitive to optimizer differences
τ² estimate correlation	r ≥ 0.95	S1	Random-effect variance estimates should correlate highly
Power difference	≤ 5 pp	S3	Accounts for Monte Carlo sampling variability (~2–3 pp at 500 sims)
Adjusted p-value (BH FDR)	> 0.05	S4	Two-proportion z-test with FDR correction — cannot reject equal power
Type I error rate	3%–7%	S1	Null datasets should reject at approximately α = 5%

10. R Implementation Details

For R users who want to understand exactly what lme4 configuration is being compared against.

R Packages

Package	Purpose
`lme4`	Core mixed-effects model fitting via `lmer()`
`lmerTest`	Extends `lmer` summary to include Satterthwaite p-values for fixed effects
`jsonlite`	JSON serialization for R↔Python data exchange

Model Fitting Configuration

Estimation: REML for parameter extraction, ML refit for Likelihood Ratio Tests
Optimizer: bobyqa (bound optimization by quadratic approximation)
Max iterations: 50,000 function evaluations via optCtrl = list(maxfun = 50000)
Convergence failures: caught with tryCatch(), marked as unconverged, excluded from comparisons

Statistical Tests

Overall Significance: Likelihood Ratio Test

Tests whether any fixed effect is significantly different from zero

# Null model: intercept + random effects only
null_fit <- lmer(y ~ 1 + (1 | g), data = d, REML = FALSE)
full_ml  <- update(fit, REML = FALSE)
lr_test  <- anova(null_fit, full_ml)
# Chi-squared test with df = number of fixed effects

The null model retains the random-effects structure but drops all fixed-effect predictors. Both models are refit with ML (not REML) because REML likelihoods are not comparable across models with different fixed effects.

Individual Effects: Wald z-tests

Tests each fixed-effect coefficient individually

Via lmerTest::summary() which adds Satterthwaite-approximated degrees of freedom and p-values to the standard lmer coefficient table. The Wald statistic is t = β̂ / SE(β̂) with Satterthwaite df.

R↔Python Communication

R scripts are invoked via subprocess.run(["Rscript", "--vanilla", script_path, ...]) with a 600-second timeout. Data passes as CSV files; results return as JSON on stdout. Stderr is reserved for R warnings/messages and discarded unless an error occurs.

A smart caching layer hashes the scenario parameters + R script contents (SHA-256) so unchanged scenarios skip re-running R, which is critical since a full validation run with all strategies can take 30+ minutes.

11. Running the Validation

Prerequisites

Python 3.10+ with MCPower installed (pip install -e MCPower/)
R with packages: lme4, lmerTest, jsonlite
Rscript available on $PATH

Commands

# Run all strategies on all scenarios (full validation)
python LME4-validation/run_validation.py

# Run only Strategy 1 (solver comparison)
python LME4-validation/run_validation.py --strategy 1

# Run specific scenarios by name prefix
python LME4-validation/run_validation.py --scenarios intercept_icc0.2

# Run with more datasets (default: 100) or simulations (default: 500)
python LME4-validation/run_validation.py --n-datasets 200 --n-sims 1000

# Clear cached R results and re-run from scratch
python LME4-validation/run_validation.py --clear-cache

# Disable caching entirely
python LME4-validation/run_validation.py --no-cache

# Set custom random seed (default: 42)
python LME4-validation/run_validation.py --seed 123

Example output

Running validation: 27 scenarios, strategies [1, 2, 3, 4]
  n_datasets=100, n_sims_s3=500, seed=42

============================================================
Strategy 1: External Data Agreement
============================================================
  [S1] intercept_icc0.2_cl20_n1000_small ... PASS (agree=100.0%, 12.4s)
  [S1] intercept_icc0.2_cl20_n1000_medium ... PASS (agree=100.0%, 11.8s)
  ...

============================================================
Strategy 3: Parallel Power Simulation
============================================================
  [S3] intercept_icc0.2_cl20_n1000_medium ... PASS (MC=0.847, R=0.832, 45.2s)
  ...

============================================================
Strategy 4: Statistical Power Comparison (Z-test + FDR)
============================================================
  [S4] intercept_icc0.2_cl20_n1000_medium: PASS (z=0.672, adj_p=0.8921)
  ...

============================================================
DONE: 27/27 scenarios passed (892.3s total)
Report: LME4-validation/reports/validation_report_20260222_143201.html
============================================================

12. Reading the Reports

After each run, a self-contained HTML report is generated in LME4-validation/reports/. The report includes:

Summary banner — overall PASS/FAIL status with scenario counts per strategy
Strategy 1 table — agreement rate, Cohen's κ, number of datasets compared, color-coded result per scenario
Strategy 2 table — agreement rate and sample count per scenario
Strategy 3 table — MCPower power, R power, absolute difference per scenario
Strategy 4 table + scatter plot — z-scores, raw p-values, BH-adjusted p-values, and pass/fail for each scenario; summary line showing number of rejections; inline SVG scatter plot of MCPower vs R power (points on the diagonal = perfect agreement)

Reports are fully self-contained (inline CSS, no external dependencies) and can be shared directly. Each cell is color-coded green (pass) or red (fail) for quick scanning.

Reproducibility: Reports include the timestamp. Re-running with the same --seed produces identical results (assuming R and MCPower versions haven't changed). The caching system's SHA-256 keys incorporate the R script source, so updating an R script automatically invalidates the relevant cache entries.