Tutorial: Controlling for Multiple Testing¶

Goal¶

You are testing multiple hypotheses and need to control the false positive rate so your power analysis reflects the correction you will use in your actual study.

Full Working Example¶

from mcpower import MCPower

# ── Medical study with 5 biomarkers ──────────────────────────────
# Some biomarkers have real effects, others are null (effect = 0)
model = MCPower("outcome = biomarker1 + biomarker2 + biomarker3 + biomarker4 + biomarker5")
model.set_simulations(400)

model.set_effects(
    "biomarker1=0.40, "    # large real effect
    "biomarker2=0.25, "    # medium real effect
    "biomarker3=0.00, "    # null — no true effect
    "biomarker4=0.10, "    # small real effect
    "biomarker5=0.00"      # null — no true effect
)

# ── Check power with Bonferroni correction ────────────────────────
model.find_power(
    sample_size=200,
    target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5",
    correction="bonferroni",
)

Step-by-Step Walkthrough¶

1. Define the model¶

model = MCPower("outcome = biomarker1 + biomarker2 + biomarker3 + biomarker4 + biomarker5")
model.set_simulations(400)

A study measuring five biomarkers as potential predictors of a health outcome.

2. Set effect sizes, including nulls¶

model.set_effects(
    "biomarker1=0.40, "
    "biomarker2=0.25, "
    "biomarker3=0.00, "
    "biomarker4=0.10, "
    "biomarker5=0.00"
)

Setting an effect to 0.00 models a predictor that has no real relationship with the outcome. This is critical for understanding false positive rates – with a correction, you expect these null effects to rarely reach significance.

3. Apply a correction¶

model.find_power(
    sample_size=200,
    target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5",
    correction="bonferroni",
)

The correction parameter adjusts significance thresholds to account for testing five effects simultaneously. Without correction, the chance of at least one false positive across 5 tests is about 23%.

Output Interpretation¶

================================================================================
MONTE CARLO POWER ANALYSIS RESULTS
================================================================================
Multiple comparison correction: bonferroni

Power Analysis Results (N=200):
Test                                     Power    Target   Status
-------------------------------------------------------------------
biomarker1                               99.9     80       ✓
biomarker2                               92.9     80       ✓
biomarker3                               5.1      80       ✗
biomarker4                               29.1     80       ✗
biomarker5                               5.2      80       ✗

With bonferroni correction:
Test                                     Power    Target   Status
-------------------------------------------------------------------
biomarker1                               99.9     80       ✓
biomarker2                               81.2     80       ✓
biomarker3                               0.4      80       ✗
biomarker4                               13.4     80       ✗
biomarker5                               1.1      80       ✗

Result: 2/5 tests achieved target power

The output now shows two separate tables: the first shows uncorrected power, the second shows power after Bonferroni adjustment. Corrected power is always lower than or equal to uncorrected power.
Status – ✓ if power meets the target, ✗ if it falls short. The final result is based on the corrected table.
biomarker3 and biomarker5 (null effects) – the corrected false positive rates are 0.4% and 1.1%, well below the 5% threshold. The correction is working.
biomarker2 – a medium effect needs a larger sample to survive Bonferroni correction.
biomarker4 – a small effect has very low power after correction.

Common Variations¶

Correction methods compared¶

MCPower supports four correction methods (Bonferroni, Holm, FDR/Benjamini-Hochberg, and Tukey):

# Bonferroni — most conservative FWER control
model.find_power(sample_size=200, target_test="all", correction="bonferroni")

# Holm — step-down FWER control (always >= Bonferroni power)
model.find_power(sample_size=200, target_test="all", correction="holm")

# FDR / Benjamini-Hochberg — controls false discovery rate (least conservative)
model.find_power(sample_size=200, target_test="all", correction="fdr")
# Aliases: "benjamini-hochberg" or "bh" also work

# Tukey HSD — for post-hoc pairwise factor comparisons only
model.find_power(
    sample_size=200,
    target_test="group[1] vs group[2], group[1] vs group[3], group[2] vs group[3]",
    correction="tukey",
)

When to use which correction¶

Situation	Recommended	Reasoning
Pre-registered confirmatory study	Holm or Bonferroni	Strict FWER control expected by reviewers
Few planned comparisons (2–3)	Holm	Controls FWER; more powerful than Bonferroni
Many comparisons (5+) in exploratory study	FDR	Less conservative; allows more discoveries
Strict error control required	Bonferroni	Simplest and most conservative
All pairwise comparisons within a factor	Tukey	Purpose-built for this case
Mixed regular + post-hoc tests	Holm	Corrects all tests together uniformly

General recommendation: If unsure, use Holm. It controls the family-wise error rate and is always at least as powerful as Bonferroni.

Focused testing strategy¶

Instead of testing all 5 biomarkers, focus on the ones you care most about. Fewer tests means less power loss from correction:

# Only test the biomarkers you have specific hypotheses about
model.find_power(
    sample_size=200,
    target_test="biomarker1, biomarker2",
    correction="bonferroni",
)
# Family size = 2 instead of 5 → much less power loss

This is a legitimate strategy when you have pre-registered hypotheses for specific predictors. The other predictors are still in the model as covariates, but you only correct for the tests you report.

Exploratory vs. confirmatory workflow¶

A common strategy is to run two analyses: an exploratory sweep with FDR to identify promising effects, then a confirmatory analysis with stricter correction:

# ── Step 1: Exploratory — which biomarkers are worth pursuing? ────
model.find_power(
    sample_size=200,
    target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5",
    correction="fdr",
)

# ── Step 2: Confirmatory — plan a study targeting the best candidates ──
model.find_sample_size(
    target_test="biomarker1, biomarker2",   # only the promising ones
    from_size=100,
    to_size=500,
    by=45,
    correction="holm",
)

The exploratory phase uses FDR (more lenient) to cast a wide net. The confirmatory phase uses Holm (strict FWER) on a focused set of pre-registered hypotheses.

Finding required sample size with correction¶

Search for the sample size that achieves 80% corrected power:

model.find_sample_size(
    target_test="biomarker1, biomarker2, biomarker4",
    from_size=50,
    to_size=600,
    by=62,
    correction="holm",
)

The result accounts for the correction – you need a larger sample than without correction.

Combining corrections with scenario analysis¶

Test robustness under realistic assumptions:

model.find_sample_size(
    target_test="biomarker1, biomarker2",
    from_size=100,
    to_size=500,
    by=45,
    correction="holm",
    scenarios=True,
)

The scenario analysis (Optimistic / Realistic / Doomer) is applied on top of the correction, giving you the most conservative planning estimate.

Corrections with post-hoc comparisons¶

When combining standard tests and post-hoc comparisons, the correction method determines what gets corrected:

model = MCPower("outcome = treatment + covariate")
model.set_simulations(400)
model.set_variable_type("treatment=(factor,3)")
model.set_effects("treatment[2]=0.50, treatment[3]=0.80, covariate=0.25")

# Bonferroni: ALL tests form one correction family
model.find_power(
    sample_size=200,
    target_test="covariate, treatment[1] vs treatment[2], treatment[1] vs treatment[3]",
    correction="bonferroni",
)
# Family size = 3 → corrected threshold = 0.05/3

# Tukey: ONLY post-hoc contrasts are corrected
model.find_power(
    sample_size=200,
    target_test="covariate, treatment[1] vs treatment[2], treatment[1] vs treatment[3]",
    correction="tukey",
)
# covariate → corrected shows "-" (not applicable)
# post-hoc tests → Tukey-corrected

How corrections affect the overall F-test¶

The overall F-test is never included in the correction family for Bonferroni/Holm/FDR corrections:

model.find_power(
    sample_size=200,
    target_test="all",
    correction="bonferroni",
)
# "overall" F-test power is NOT reduced by the correction
# Only the individual t-tests are corrected

Correction Methods: Technical Summary¶

Method	Type	How It Works	Power
Bonferroni	FWER	Divides alpha by number of tests: alpha/m	Lowest
Holm	FWER	Step-down: tests ordered by p-value, thresholds get progressively less strict	>= Bonferroni
FDR	FDR	Step-up: controls the expected proportion of false discoveries	Highest
Tukey	FWER	Uses Studentized Range distribution for pairwise factor comparisons	Depends on factor levels

FWER (Family-Wise Error Rate) – controls the probability of making any false positive. Use when even one false positive is unacceptable.
FDR (False Discovery Rate) – controls the proportion of false positives among significant results. Use when some false positives are tolerable.

MCPower precomputes correction-adjusted critical values before running simulations, so there is no runtime overhead from using a correction.

Next Steps¶

Tutorial: ANOVA & Post-Hoc Comparisons – using Tukey correction with factor variables
Scenario Analysis – combining corrections with robustness testing
API Reference – full find_power() and find_sample_size() parameter documentation

References¶

Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52–64.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
Tukey, J. W. (1953). The problem of multiple comparisons. Unpublished manuscript, Princeton University.