--- jupytext: text_representation: format_name: myst kernelspec: name: python3 --- # Tutorial: Controlling for Multiple Testing ```{code-cell} ipython3 :tags: [remove-input, remove-output] import numpy as np np.random.seed(42) import warnings warnings.filterwarnings("ignore", message="Low simulation") ``` ## Goal You are testing multiple hypotheses and need to control the false positive rate so your power analysis reflects the correction you will use in your actual study. --- ## Full Working Example ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] from mcpower import MCPower # ── Medical study with 5 biomarkers ────────────────────────────── # Some biomarkers have real effects, others are null (effect = 0) model = MCPower("outcome = biomarker1 + biomarker2 + biomarker3 + biomarker4 + biomarker5") model.set_simulations(400) model.set_effects( "biomarker1=0.40, " # large real effect "biomarker2=0.25, " # medium real effect "biomarker3=0.00, " # null — no true effect "biomarker4=0.10, " # small real effect "biomarker5=0.00" # null — no true effect ) # ── Check power with Bonferroni correction ──────────────────────── model.find_power( sample_size=200, target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5", correction="bonferroni", ) ``` --- ## Step-by-Step Walkthrough ### 1. Define the model ```{code-cell} ipython3 :tags: [remove-output] model = MCPower("outcome = biomarker1 + biomarker2 + biomarker3 + biomarker4 + biomarker5") model.set_simulations(400) ``` A study measuring five biomarkers as potential predictors of a health outcome. ### 2. Set effect sizes, including nulls ```{code-cell} ipython3 :tags: [remove-output] model.set_effects( "biomarker1=0.40, " "biomarker2=0.25, " "biomarker3=0.00, " "biomarker4=0.10, " "biomarker5=0.00" ) ``` Setting an effect to `0.00` models a predictor that has **no real relationship** with the outcome. This is critical for understanding false positive rates -- with a correction, you expect these null effects to rarely reach significance. ### 3. Apply a correction ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] model.find_power( sample_size=200, target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5", correction="bonferroni", ) ``` The `correction` parameter adjusts significance thresholds to account for testing five effects simultaneously. Without correction, the chance of at least one false positive across 5 tests is about 23%. --- ## Output Interpretation ``` ================================================================================ MONTE CARLO POWER ANALYSIS RESULTS ================================================================================ Multiple comparison correction: bonferroni Power Analysis Results (N=200): Test Power Target Status ------------------------------------------------------------------- biomarker1 99.9 80 ✓ biomarker2 92.9 80 ✓ biomarker3 5.1 80 ✗ biomarker4 29.1 80 ✗ biomarker5 5.2 80 ✗ With bonferroni correction: Test Power Target Status ------------------------------------------------------------------- biomarker1 99.9 80 ✓ biomarker2 81.2 80 ✓ biomarker3 0.4 80 ✗ biomarker4 13.4 80 ✗ biomarker5 1.1 80 ✗ Result: 2/5 tests achieved target power ``` - The output now shows **two separate tables**: the first shows uncorrected power, the second shows power after Bonferroni adjustment. Corrected power is always lower than or equal to uncorrected power. - **Status** -- `✓` if power meets the target, `✗` if it falls short. The final result is based on the **corrected** table. - **biomarker3 and biomarker5** (null effects) -- the corrected false positive rates are 0.4% and 1.1%, well below the 5% threshold. The correction is working. - **biomarker2** -- a medium effect needs a larger sample to survive Bonferroni correction. - **biomarker4** -- a small effect has very low power after correction. --- ## Common Variations ### Correction methods compared MCPower supports four correction methods (Bonferroni, Holm, FDR/Benjamini-Hochberg, and Tukey): ```python # Bonferroni — most conservative FWER control model.find_power(sample_size=200, target_test="all", correction="bonferroni") # Holm — step-down FWER control (always >= Bonferroni power) model.find_power(sample_size=200, target_test="all", correction="holm") # FDR / Benjamini-Hochberg — controls false discovery rate (least conservative) model.find_power(sample_size=200, target_test="all", correction="fdr") # Aliases: "benjamini-hochberg" or "bh" also work # Tukey HSD — for post-hoc pairwise factor comparisons only model.find_power( sample_size=200, target_test="group[1] vs group[2], group[1] vs group[3], group[2] vs group[3]", correction="tukey", ) ``` ### When to use which correction | Situation | Recommended | Reasoning | |---|---|---| | Pre-registered confirmatory study | **Holm** or **Bonferroni** | Strict FWER control expected by reviewers | | Few planned comparisons (2--3) | **Holm** | Controls FWER; more powerful than Bonferroni | | Many comparisons (5+) in exploratory study | **FDR** | Less conservative; allows more discoveries | | Strict error control required | **Bonferroni** | Simplest and most conservative | | All pairwise comparisons within a factor | **Tukey** | Purpose-built for this case | | Mixed regular + post-hoc tests | **Holm** | Corrects all tests together uniformly | **General recommendation:** If unsure, use **Holm**. It controls the family-wise error rate and is always at least as powerful as Bonferroni. ### Focused testing strategy Instead of testing all 5 biomarkers, focus on the ones you care most about. Fewer tests means less power loss from correction: ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] # Only test the biomarkers you have specific hypotheses about model.find_power( sample_size=200, target_test="biomarker1, biomarker2", correction="bonferroni", ) # Family size = 2 instead of 5 → much less power loss ``` This is a legitimate strategy when you have **pre-registered hypotheses** for specific predictors. The other predictors are still in the model as covariates, but you only correct for the tests you report. ### Exploratory vs. confirmatory workflow A common strategy is to run two analyses: an exploratory sweep with FDR to identify promising effects, then a confirmatory analysis with stricter correction: ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] # ── Step 1: Exploratory — which biomarkers are worth pursuing? ──── model.find_power( sample_size=200, target_test="biomarker1, biomarker2, biomarker3, biomarker4, biomarker5", correction="fdr", ) # ── Step 2: Confirmatory — plan a study targeting the best candidates ── model.find_sample_size( target_test="biomarker1, biomarker2", # only the promising ones from_size=100, to_size=500, by=45, correction="holm", ) ``` The exploratory phase uses FDR (more lenient) to cast a wide net. The confirmatory phase uses Holm (strict FWER) on a focused set of pre-registered hypotheses. ### Finding required sample size with correction Search for the sample size that achieves 80% **corrected** power: ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] model.find_sample_size( target_test="biomarker1, biomarker2, biomarker4", from_size=50, to_size=600, by=62, correction="holm", ) ``` The result accounts for the correction -- you need a larger sample than without correction. ### Combining corrections with scenario analysis Test robustness under realistic assumptions: ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] model.find_sample_size( target_test="biomarker1, biomarker2", from_size=100, to_size=500, by=45, correction="holm", scenarios=True, ) ``` The scenario analysis (Optimistic / Realistic / Doomer) is applied **on top of** the correction, giving you the most conservative planning estimate. ### Corrections with post-hoc comparisons When combining standard tests and post-hoc comparisons, the correction method determines what gets corrected: ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] model = MCPower("outcome = treatment + covariate") model.set_simulations(400) model.set_variable_type("treatment=(factor,3)") model.set_effects("treatment[2]=0.50, treatment[3]=0.80, covariate=0.25") # Bonferroni: ALL tests form one correction family model.find_power( sample_size=200, target_test="covariate, treatment[1] vs treatment[2], treatment[1] vs treatment[3]", correction="bonferroni", ) # Family size = 3 → corrected threshold = 0.05/3 # Tukey: ONLY post-hoc contrasts are corrected model.find_power( sample_size=200, target_test="covariate, treatment[1] vs treatment[2], treatment[1] vs treatment[3]", correction="tukey", ) # covariate → corrected shows "-" (not applicable) # post-hoc tests → Tukey-corrected ``` ### How corrections affect the overall F-test The overall F-test is never included in the correction family for Bonferroni/Holm/FDR corrections: ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] model.find_power( sample_size=200, target_test="all", correction="bonferroni", ) # "overall" F-test power is NOT reduced by the correction # Only the individual t-tests are corrected ``` --- ## Correction Methods: Technical Summary | Method | Type | How It Works | Power | |---|---|---|---| | **Bonferroni** | FWER | Divides alpha by number of tests: alpha/m | Lowest | | **Holm** | FWER | Step-down: tests ordered by p-value, thresholds get progressively less strict | >= Bonferroni | | **FDR** | FDR | Step-up: controls the expected proportion of false discoveries | Highest | | **Tukey** | FWER | Uses Studentized Range distribution for pairwise factor comparisons | Depends on factor levels | - **FWER** (Family-Wise Error Rate) -- controls the probability of making **any** false positive. Use when even one false positive is unacceptable. - **FDR** (False Discovery Rate) -- controls the **proportion** of false positives among significant results. Use when some false positives are tolerable. MCPower precomputes correction-adjusted critical values **before** running simulations, so there is no runtime overhead from using a correction. --- ## Next Steps - **[Tutorial: ANOVA & Post-Hoc Comparisons](anova-posthoc.md)** -- using Tukey correction with factor variables - **[Scenario Analysis](../concepts/scenario-analysis.md)** -- combining corrections with robustness testing - **[API Reference](../api/index.md)** -- full `find_power()` and `find_sample_size()` parameter documentation --- ## References - Dunn, O. J. (1961). Multiple comparisons among means. *Journal of the American Statistical Association*, *56*(293), 52--64. - Holm, S. (1979). A simple sequentially rejective multiple test procedure. *Scandinavian Journal of Statistics*, *6*(2), 65--70. - Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society: Series B*, *57*(1), 289--300. - Tukey, J. W. (1953). *The problem of multiple comparisons*. Unpublished manuscript, Princeton University.