Tutorial: Controlling for Multiple Testing

Goal

You are testing multiple hypotheses and need to control the false positive rate so your power analysis reflects the correction you will use in your actual study.


Full Working Example

from mcpower import MCPower

# ── Medical study with 5 biomarkers ──────────────────────────────
# Some biomarkers have real effects, others are null (effect = 0)
model = MCPower("outcome = biomarker1 + biomarker2 + biomarker3 + biomarker4 + biomarker5")
model.set_simulations(400)

model.set_effects(
    "biomarker1=0.40, "    # large real effect
    "biomarker2=0.25, "    # medium real effect
    "biomarker3=0.00, "    # null — no true effect
    "biomarker4=0.10, "    # small real effect
    "biomarker5=0.00"      # null — no true effect
)

# ── Check power with Bonferroni correction ────────────────────────
model.find_power(
    sample_size=200,
    target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5",
    correction="bonferroni",
)

Step-by-Step Walkthrough

1. Define the model

model = MCPower("outcome = biomarker1 + biomarker2 + biomarker3 + biomarker4 + biomarker5")
model.set_simulations(400)

A study measuring five biomarkers as potential predictors of a health outcome.

2. Set effect sizes, including nulls

model.set_effects(
    "biomarker1=0.40, "
    "biomarker2=0.25, "
    "biomarker3=0.00, "
    "biomarker4=0.10, "
    "biomarker5=0.00"
)

Setting an effect to 0.00 models a predictor that has no real relationship with the outcome. This is critical for understanding false positive rates – with a correction, you expect these null effects to rarely reach significance.

3. Apply a correction

model.find_power(
    sample_size=200,
    target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5",
    correction="bonferroni",
)

The correction parameter adjusts significance thresholds to account for testing five effects simultaneously. Without correction, the chance of at least one false positive across 5 tests is about 23%.


Output Interpretation

================================================================================
MONTE CARLO POWER ANALYSIS RESULTS
================================================================================
Multiple comparison correction: bonferroni

Power Analysis Results (N=200):
Test                                     Power    Target   Status
-------------------------------------------------------------------
biomarker1                               99.9     80       ✓
biomarker2                               92.9     80       ✓
biomarker3                               5.1      80       ✗
biomarker4                               29.1     80       ✗
biomarker5                               5.2      80       ✗

With bonferroni correction:
Test                                     Power    Target   Status
-------------------------------------------------------------------
biomarker1                               99.9     80       ✓
biomarker2                               81.2     80       ✓
biomarker3                               0.4      80       ✗
biomarker4                               13.4     80       ✗
biomarker5                               1.1      80       ✗

Result: 2/5 tests achieved target power
  • The output now shows two separate tables: the first shows uncorrected power, the second shows power after Bonferroni adjustment. Corrected power is always lower than or equal to uncorrected power.

  • Status if power meets the target, if it falls short. The final result is based on the corrected table.

  • biomarker3 and biomarker5 (null effects) – the corrected false positive rates are 0.4% and 1.1%, well below the 5% threshold. The correction is working.

  • biomarker2 – a medium effect needs a larger sample to survive Bonferroni correction.

  • biomarker4 – a small effect has very low power after correction.


Common Variations

Correction methods compared

MCPower supports four correction methods (Bonferroni, Holm, FDR/Benjamini-Hochberg, and Tukey):

# Bonferroni — most conservative FWER control
model.find_power(sample_size=200, target_test="all", correction="bonferroni")

# Holm — step-down FWER control (always >= Bonferroni power)
model.find_power(sample_size=200, target_test="all", correction="holm")

# FDR / Benjamini-Hochberg — controls false discovery rate (least conservative)
model.find_power(sample_size=200, target_test="all", correction="fdr")
# Aliases: "benjamini-hochberg" or "bh" also work

# Tukey HSD — for post-hoc pairwise factor comparisons only
model.find_power(
    sample_size=200,
    target_test="group[1] vs group[2], group[1] vs group[3], group[2] vs group[3]",
    correction="tukey",
)

When to use which correction

Situation

Recommended

Reasoning

Pre-registered confirmatory study

Holm or Bonferroni

Strict FWER control expected by reviewers

Few planned comparisons (2–3)

Holm

Controls FWER; more powerful than Bonferroni

Many comparisons (5+) in exploratory study

FDR

Less conservative; allows more discoveries

Strict error control required

Bonferroni

Simplest and most conservative

All pairwise comparisons within a factor

Tukey

Purpose-built for this case

Mixed regular + post-hoc tests

Holm

Corrects all tests together uniformly

General recommendation: If unsure, use Holm. It controls the family-wise error rate and is always at least as powerful as Bonferroni.

Focused testing strategy

Instead of testing all 5 biomarkers, focus on the ones you care most about. Fewer tests means less power loss from correction:

# Only test the biomarkers you have specific hypotheses about
model.find_power(
    sample_size=200,
    target_test="biomarker1, biomarker2",
    correction="bonferroni",
)
# Family size = 2 instead of 5 → much less power loss

This is a legitimate strategy when you have pre-registered hypotheses for specific predictors. The other predictors are still in the model as covariates, but you only correct for the tests you report.

Exploratory vs. confirmatory workflow

A common strategy is to run two analyses: an exploratory sweep with FDR to identify promising effects, then a confirmatory analysis with stricter correction:

# ── Step 1: Exploratory — which biomarkers are worth pursuing? ────
model.find_power(
    sample_size=200,
    target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5",
    correction="fdr",
)

# ── Step 2: Confirmatory — plan a study targeting the best candidates ──
model.find_sample_size(
    target_test="biomarker1, biomarker2",   # only the promising ones
    from_size=100,
    to_size=500,
    by=45,
    correction="holm",
)

The exploratory phase uses FDR (more lenient) to cast a wide net. The confirmatory phase uses Holm (strict FWER) on a focused set of pre-registered hypotheses.

Finding required sample size with correction

Search for the sample size that achieves 80% corrected power:

model.find_sample_size(
    target_test="biomarker1, biomarker2, biomarker4",
    from_size=50,
    to_size=600,
    by=62,
    correction="holm",
)

The result accounts for the correction – you need a larger sample than without correction.

Combining corrections with scenario analysis

Test robustness under realistic assumptions:

model.find_sample_size(
    target_test="biomarker1, biomarker2",
    from_size=100,
    to_size=500,
    by=45,
    correction="holm",
    scenarios=True,
)

The scenario analysis (Optimistic / Realistic / Doomer) is applied on top of the correction, giving you the most conservative planning estimate.

Corrections with post-hoc comparisons

When combining standard tests and post-hoc comparisons, the correction method determines what gets corrected:

model = MCPower("outcome = treatment + covariate")
model.set_simulations(400)
model.set_variable_type("treatment=(factor,3)")
model.set_effects("treatment[2]=0.50, treatment[3]=0.80, covariate=0.25")

# Bonferroni: ALL tests form one correction family
model.find_power(
    sample_size=200,
    target_test="covariate, treatment[1] vs treatment[2], treatment[1] vs treatment[3]",
    correction="bonferroni",
)
# Family size = 3 → corrected threshold = 0.05/3

# Tukey: ONLY post-hoc contrasts are corrected
model.find_power(
    sample_size=200,
    target_test="covariate, treatment[1] vs treatment[2], treatment[1] vs treatment[3]",
    correction="tukey",
)
# covariate → corrected shows "-" (not applicable)
# post-hoc tests → Tukey-corrected

How corrections affect the overall F-test

The overall F-test is never included in the correction family for Bonferroni/Holm/FDR corrections:

model.find_power(
    sample_size=200,
    target_test="all",
    correction="bonferroni",
)
# "overall" F-test power is NOT reduced by the correction
# Only the individual t-tests are corrected

Correction Methods: Technical Summary

Method

Type

How It Works

Power

Bonferroni

FWER

Divides alpha by number of tests: alpha/m

Lowest

Holm

FWER

Step-down: tests ordered by p-value, thresholds get progressively less strict

>= Bonferroni

FDR

FDR

Step-up: controls the expected proportion of false discoveries

Highest

Tukey

FWER

Uses Studentized Range distribution for pairwise factor comparisons

Depends on factor levels

  • FWER (Family-Wise Error Rate) – controls the probability of making any false positive. Use when even one false positive is unacceptable.

  • FDR (False Discovery Rate) – controls the proportion of false positives among significant results. Use when some false positives are tolerable.

MCPower precomputes correction-adjusted critical values before running simulations, so there is no runtime overhead from using a correction.


Next Steps


References

  • Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52–64.

  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.

  • Tukey, J. W. (1953). The problem of multiple comparisons. Unpublished manuscript, Princeton University.