How It Is Validated

MCPower’s statistical engine is validated through two complementary systems: an internal test suite covering OLS and mixed-effects accuracy, and an external cross-validation framework comparing MCPower’s LME solver against R’s lme4 package.

Internal Test Suite

MCPower includes ~11,000 lines of tests organized into specs (accuracy/validation), integration, unit, and mixed-model tests. The specs tests are the core statistical validation.

Power Accuracy Tests

Monte Carlo power estimates are compared against exact analytical power from non-central t and F distributions.

OLS models (test_power_accuracy.py):

  • Single predictor: 5 parametrized cases varying β and N

  • Two uncorrelated predictors: 3 cases with Σ = I

  • Two correlated predictors: 4 cases with VIF correction (ρ = 0.3, 0.5, 0.7)

Acceptance criterion: MC estimate within 3.5 × √[p(1−p)/5000] × 100 + 1pp of the analytical value. This is a Bonferroni-safe margin (~2-3 percentage points at typical power levels) using 5,000 simulations.

LME models (test_power_accuracy_lme.py):

  • Single predictor z-test: 7 parametrized cases

  • Single predictor likelihood-ratio test: 3 cases

  • Two uncorrelated predictors: 2 cases

  • Two correlated predictors: 2 cases (ρ = 0.3, 0.5)

All LME accuracy tests use m = 50 observations per cluster, where the within-cluster design effect is small (~1.02–1.06), allowing comparison against analytical formulas.

Type I Error Control

Under the null hypothesis (all effects = 0), the rejection rate must equal the nominal α.

OLS (test_type1_error.py):

  • Single predictor null (F-test and t-test)

  • Two predictors null (each rejects at ~α)

  • Large-sample null (catches bugs where power inflates with N)

  • Alpha calibration at α ∈ {0.01, 0.05, 0.10}

LME (test_type1_error_lme.py):

  • Same structure with K = 20–50 clusters, ICC = 0.2

  • Alpha calibration across standard levels

Criterion: |observed rejection rate − α × 100| < MC margin

Monotonicity Tests

Power must strictly increase with:

  • Effect size (larger β → more power)

  • Sample size (larger N → more power)

  • Significance level (larger α → more power)

Tested for both OLS (test_monotonicity.py) and LME (test_monotonicity_lme.py) models. These tests catch subtle implementation bugs that wouldn’t violate accuracy bounds but would produce nonsensical results.

Multiple Comparison Corrections

Correction conservativeness (test_corrections.py):

  • Corrected power ≤ uncorrected power under H₀

  • Bonferroni more conservative than FDR

  • FWER ≤ α for Bonferroni and Holm

Extended alpha validation (test_alpha_levels.py):

  • 9 tests validating Bonferroni/Holm/FDR at non-default α ∈ {0.01, 0.10}

  • Multi-predictor null calibration with corrections

LME Accuracy Tests

The analytical formulas used as benchmarks for LME tests:

Design effect (within-cluster): $\(D_{\text{eff}} = \frac{1 + (m-1) \times \text{ICC}}{1 + (m-2) \times \text{ICC}}\)$

This is much milder than the between-cluster design effect for iid predictors — typically 1.02–1.06 for m = 50.

z-test non-centrality parameter: $\(\text{NCP} = \frac{\beta \sqrt{n_{\text{eff}}}}{\sigma \sqrt{\text{VIF} \times D_{\text{eff}}}}\)$

Likelihood-ratio test NCP: $\(\text{NCP} = \frac{n \cdot \boldsymbol{\beta}' \Sigma \boldsymbol{\beta}}{\sigma^2 \times D_{\text{eff}}}\)$


External Cross-Validation (LME4)

MCPower’s C++ LME solver is cross-validated against R’s lme4 package using the MCPower-LME4-validation framework. This is a separate repository with its own test harness.

Four Validation Strategies

Strategy

What It Tests

How

1. External Data Agreement

Do MCPower and lme4 reach the same significance decision on identical data?

Generate data with numpy, fit both solvers, compare significance decisions. Target: ≥95% agreement rate.

2. MCPower Pipeline Validation

Does MCPower’s full pipeline (data generation → fitting) produce results consistent with lme4?

Extract raw data from MCPower’s simulations, re-fit with lme4, compare significance decisions.

3. Parallel Power Simulation

Do independent power simulations produce the same power estimate?

Both MCPower and R independently generate data and estimate power. Target:

4. Statistical z-Test

Is the power difference statistically significant?

Two-proportion z-test on the power estimates from Strategy 3, with Benjamini-Hochberg FDR correction across all scenarios.

Strategy 1 validates the solver in isolation. Strategy 2 validates the full pipeline (including data generation). Strategy 3 validates end-to-end power estimates. Strategy 4 provides statistical rigor for the power comparison.

Scenario Coverage

95 unique scenarios across three model types:

Model Type

Core

Sensitivity

Total

Random intercepts (1 predictor)

36

24

60

Random intercepts (2 predictors)

2

8

10

Random slopes

4

10

14

Nested effects

3

8

11

Total

45

50

95

Core scenarios run all 4 strategies (45 × 4 = 180 tests). Sensitivity scenarios run Strategy 4 only (50 × 1 = 50 tests). Total: 230 scenario-strategy combinations.

Core scenarios vary: ICC ∈ {0.1, 0.2, 0.3}, clusters ∈ {10, 20, 50}, N ∈ {500, 1000}, effects ∈ {small, medium}.

Sensitivity scenarios systematically sweep one parameter while holding others fixed, producing power curves for visual and statistical comparison.

Pass/Fail Thresholds

Metric

Threshold

Strategy

Significance agreement rate

≥ 95%

1, 2

Beta estimate correlation

≥ 0.98

1, 2

SE estimate correlation

≥ 0.95

1, 2

τ² estimate correlation

≥ 0.95

1, 2

Power difference (absolute)

≤ 5 pp

3

Type I error rate

3%–7% (at α = 0.05)

3

z-test (FDR-corrected)

p > 0.05

4

Latest Results

Result: 230/230 PASS

The validation report is published at: https://freestylerscientist.pl/reports/lme4-validation-report.html


How to Run

Internal test suite

# OLS tests only (fast, ~30s)
python -m pytest MCPower/tests/ -v -m "not lme"

# All tests including LME (~6 min)
python -m pytest MCPower/tests/ -v

# Accuracy tests only
python -m pytest MCPower/tests/specs/ -v

External LME4 validation

The LME4 cross-validation lives in a separate repository: MCPower-LME4-validation. Clone it and follow the instructions in its README.

See the MCPower-LME4-validation repository README for setup instructions and usage.


Validation Methodology

Why Monte Carlo margins?

Monte Carlo power estimates are inherently noisy — each estimate is a binomial proportion (fraction of simulations where p < α). The standard error is √[p(1−p)/n_sims]. MCPower uses 5,000 simulations for accuracy tests, giving SE ≈ 1% at typical power levels.

The acceptance margin 3.5 × SE + 1pp uses z = 3.5 (Bonferroni correction for ~100 simultaneous tests) plus 1 percentage point for finite-sample approximation bias.

Why cross-validate against lme4?

For OLS models, exact analytical power formulas exist (non-central t and F distributions), so MCPower can be validated against theory. For mixed-effects models, no closed-form power formulas exist in general. The gold standard is R’s lme4 package (Bates et al., 2015), which MCPower’s C++ solver reimplements using the same profiled-deviance algorithm.

Cross-validation against lme4 verifies that:

  1. MCPower’s C++ solver produces the same parameter estimates

  2. MCPower’s data generation produces valid clustered data

  3. MCPower’s power estimates match R’s independent estimates

Reproducibility

All tests use fixed random seeds (default: 2137 for MCPower tests, 42 for LME4 validation). Results are deterministic given the same seed and platform.


Learn More