How It Is Validated¶
MCPower’s statistical engine is validated through two complementary systems: an internal test suite covering OLS and mixed-effects accuracy, and an external cross-validation framework comparing MCPower’s LME solver against R’s lme4 package.
Internal Test Suite¶
MCPower includes ~11,000 lines of tests organized into specs (accuracy/validation), integration, unit, and mixed-model tests. The specs tests are the core statistical validation.
Power Accuracy Tests¶
Monte Carlo power estimates are compared against exact analytical power from non-central t and F distributions.
OLS models (test_power_accuracy.py):
Single predictor: 5 parametrized cases varying β and N
Two uncorrelated predictors: 3 cases with Σ = I
Two correlated predictors: 4 cases with VIF correction (ρ = 0.3, 0.5, 0.7)
Acceptance criterion: MC estimate within 3.5 × √[p(1−p)/5000] × 100 + 1pp of the analytical value. This is a Bonferroni-safe margin (~2-3 percentage points at typical power levels) using 5,000 simulations.
LME models (test_power_accuracy_lme.py):
Single predictor z-test: 7 parametrized cases
Single predictor likelihood-ratio test: 3 cases
Two uncorrelated predictors: 2 cases
Two correlated predictors: 2 cases (ρ = 0.3, 0.5)
All LME accuracy tests use m = 50 observations per cluster, where the within-cluster design effect is small (~1.02–1.06), allowing comparison against analytical formulas.
Type I Error Control¶
Under the null hypothesis (all effects = 0), the rejection rate must equal the nominal α.
OLS (test_type1_error.py):
Single predictor null (F-test and t-test)
Two predictors null (each rejects at ~α)
Large-sample null (catches bugs where power inflates with N)
Alpha calibration at α ∈ {0.01, 0.05, 0.10}
LME (test_type1_error_lme.py):
Same structure with K = 20–50 clusters, ICC = 0.2
Alpha calibration across standard levels
Criterion: |observed rejection rate − α × 100| < MC margin
Monotonicity Tests¶
Power must strictly increase with:
Effect size (larger β → more power)
Sample size (larger N → more power)
Significance level (larger α → more power)
Tested for both OLS (test_monotonicity.py) and LME (test_monotonicity_lme.py) models. These tests catch subtle implementation bugs that wouldn’t violate accuracy bounds but would produce nonsensical results.
Multiple Comparison Corrections¶
Correction conservativeness (test_corrections.py):
Corrected power ≤ uncorrected power under H₀
Bonferroni more conservative than FDR
FWER ≤ α for Bonferroni and Holm
Extended alpha validation (test_alpha_levels.py):
9 tests validating Bonferroni/Holm/FDR at non-default α ∈ {0.01, 0.10}
Multi-predictor null calibration with corrections
LME Accuracy Tests¶
The analytical formulas used as benchmarks for LME tests:
Design effect (within-cluster): $\(D_{\text{eff}} = \frac{1 + (m-1) \times \text{ICC}}{1 + (m-2) \times \text{ICC}}\)$
This is much milder than the between-cluster design effect for iid predictors — typically 1.02–1.06 for m = 50.
z-test non-centrality parameter: $\(\text{NCP} = \frac{\beta \sqrt{n_{\text{eff}}}}{\sigma \sqrt{\text{VIF} \times D_{\text{eff}}}}\)$
Likelihood-ratio test NCP: $\(\text{NCP} = \frac{n \cdot \boldsymbol{\beta}' \Sigma \boldsymbol{\beta}}{\sigma^2 \times D_{\text{eff}}}\)$
External Cross-Validation (LME4)¶
MCPower’s C++ LME solver is cross-validated against R’s lme4 package using the MCPower-LME4-validation framework. This is a separate repository with its own test harness.
Four Validation Strategies¶
Strategy |
What It Tests |
How |
|---|---|---|
1. External Data Agreement |
Do MCPower and lme4 reach the same significance decision on identical data? |
Generate data with numpy, fit both solvers, compare significance decisions. Target: ≥95% agreement rate. |
2. MCPower Pipeline Validation |
Does MCPower’s full pipeline (data generation → fitting) produce results consistent with lme4? |
Extract raw data from MCPower’s simulations, re-fit with lme4, compare significance decisions. |
3. Parallel Power Simulation |
Do independent power simulations produce the same power estimate? |
Both MCPower and R independently generate data and estimate power. Target: |
4. Statistical z-Test |
Is the power difference statistically significant? |
Two-proportion z-test on the power estimates from Strategy 3, with Benjamini-Hochberg FDR correction across all scenarios. |
Strategy 1 validates the solver in isolation. Strategy 2 validates the full pipeline (including data generation). Strategy 3 validates end-to-end power estimates. Strategy 4 provides statistical rigor for the power comparison.
Scenario Coverage¶
95 unique scenarios across three model types:
Model Type |
Core |
Sensitivity |
Total |
|---|---|---|---|
Random intercepts (1 predictor) |
36 |
24 |
60 |
Random intercepts (2 predictors) |
2 |
8 |
10 |
Random slopes |
4 |
10 |
14 |
Nested effects |
3 |
8 |
11 |
Total |
45 |
50 |
95 |
Core scenarios run all 4 strategies (45 × 4 = 180 tests). Sensitivity scenarios run Strategy 4 only (50 × 1 = 50 tests). Total: 230 scenario-strategy combinations.
Core scenarios vary: ICC ∈ {0.1, 0.2, 0.3}, clusters ∈ {10, 20, 50}, N ∈ {500, 1000}, effects ∈ {small, medium}.
Sensitivity scenarios systematically sweep one parameter while holding others fixed, producing power curves for visual and statistical comparison.
Pass/Fail Thresholds¶
Metric |
Threshold |
Strategy |
|---|---|---|
Significance agreement rate |
≥ 95% |
1, 2 |
Beta estimate correlation |
≥ 0.98 |
1, 2 |
SE estimate correlation |
≥ 0.95 |
1, 2 |
τ² estimate correlation |
≥ 0.95 |
1, 2 |
Power difference (absolute) |
≤ 5 pp |
3 |
Type I error rate |
3%–7% (at α = 0.05) |
3 |
z-test (FDR-corrected) |
p > 0.05 |
4 |
Latest Results¶
Result: 230/230 PASS
The validation report is published at: https://freestylerscientist.pl/reports/lme4-validation-report.html
How to Run¶
Internal test suite¶
# OLS tests only (fast, ~30s)
python -m pytest MCPower/tests/ -v -m "not lme"
# All tests including LME (~6 min)
python -m pytest MCPower/tests/ -v
# Accuracy tests only
python -m pytest MCPower/tests/specs/ -v
External LME4 validation¶
The LME4 cross-validation lives in a separate repository: MCPower-LME4-validation. Clone it and follow the instructions in its README.
See the MCPower-LME4-validation repository README for setup instructions and usage.
Validation Methodology¶
Why Monte Carlo margins?¶
Monte Carlo power estimates are inherently noisy — each estimate is a binomial proportion (fraction of simulations where p < α). The standard error is √[p(1−p)/n_sims]. MCPower uses 5,000 simulations for accuracy tests, giving SE ≈ 1% at typical power levels.
The acceptance margin 3.5 × SE + 1pp uses z = 3.5 (Bonferroni correction for ~100 simultaneous tests) plus 1 percentage point for finite-sample approximation bias.
Why cross-validate against lme4?¶
For OLS models, exact analytical power formulas exist (non-central t and F distributions), so MCPower can be validated against theory. For mixed-effects models, no closed-form power formulas exist in general. The gold standard is R’s lme4 package (Bates et al., 2015), which MCPower’s C++ solver reimplements using the same profiled-deviance algorithm.
Cross-validation against lme4 verifies that:
MCPower’s C++ solver produces the same parameter estimates
MCPower’s data generation produces valid clustered data
MCPower’s power estimates match R’s independent estimates
Reproducibility¶
All tests use fixed random seeds (default: 2137 for MCPower tests, 42 for LME4 validation). Results are deterministic given the same seed and platform.
Learn More¶
LME Validation Details — detailed strategy descriptions and scenario configuration
Performance & Backends — C++ backend details and simulation precision
MCPower-LME4-validation — external validation repository