--- jupytext: text_representation: format_name: myst kernelspec: name: python3 --- # Tutorial: Power Analysis with Your Own Data ```{code-cell} ipython3 :tags: [remove-input, remove-output] import numpy as np np.random.seed(42) import warnings warnings.filterwarnings("ignore", message="Low simulation") ``` ## Goal You have pilot data or a related dataset and want your power analysis to reflect real-world distributions instead of synthetic ones. --- ## Full Working Example ```python import pandas as pd from mcpower import MCPower # ── Load your CSV ────────────────────────────────────────────────── data = pd.read_csv("cars.csv") # ── Define the model ─────────────────────────────────────────────── model = MCPower("mpg = hp + wt + cyl") # ── Upload predictor columns ────────────────────────────────────── model.upload_data( data[["hp", "wt", "cyl"]], preserve_correlation="strict", # default — bootstrap whole rows preserve_factor_level_names=True, # default — use original values as level names ) # ── Set effect sizes (still required after upload) ──────────────── # cyl has values [4, 6, 8] → auto-detected as factor # Reference level: 4 (first sorted value) # Dummies: cyl[6] and cyl[8] model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80") # ── Run the analysis ────────────────────────────────────────────── model.find_power(sample_size=100, target_test="all") ``` --- ## Step-by-Step Walkthrough ### 1. Load the data ```python data = pd.read_csv("cars.csv") ``` Read your CSV into a pandas DataFrame. MCPower also accepts a plain dict of lists (see [Alternatives Without pandas](#alternatives-without-pandas) below). ### 2. Define the model ```python model = MCPower("mpg = hp + wt + cyl") ``` Write the formula with the outcome on the left and predictors on the right, separated by `=` or `~`. Only **predictor columns** are uploaded; the outcome is always simulated. ### 3. Upload predictor data ```python model.upload_data( data[["hp", "wt", "cyl"]], preserve_correlation="strict", preserve_factor_level_names=True, ) ``` Key points: - **`upload_data()` returns `self`** -- it can be chained with other method calls. - Pass only the predictor columns that appear in your formula. - MCPower **auto-detects variable types** based on unique value counts: | Unique Values | Detected Type | |---|---| | 1 | Dropped (constant) | | 2 | Binary | | 3--6 | Factor | | 7+ | Continuous | - **String columns** with 2--20 unique values are automatically detected as factors. In this example, `hp` and `wt` have many unique values (continuous), while `cyl` has three unique values `[4, 6, 8]` (factor). ### 4. Set effect sizes ```python model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80") ``` Uploading data provides distributions, **not** effect sizes. You must still specify how strongly each predictor relates to the outcome. With `preserve_factor_level_names=True` (the default), factor dummies use the **original data values** as level names: - `cyl[6]` -- comparing cylinders=6 to the reference (cylinders=4) - `cyl[8]` -- comparing cylinders=8 to the reference (cylinders=4) The reference level is the first sorted unique value (here, 4). ### 5. Run the analysis ```python model.find_power(sample_size=100, target_test="all") ``` `target_test="all"` reports power for every individual predictor plus the overall F-test. --- ## Output Interpretation ``` ================================================================================ MONTE CARLO POWER ANALYSIS RESULTS ================================================================================ Power Analysis Results (N=100): Test Power Target Status ------------------------------------------------------------------- overall 100.0 80 ✓ hp 24.7 80 ✗ wt 64.8 80 ✗ cyl[6] 32.8 80 ✗ cyl[8] 34.0 80 ✗ Result: 1/5 tests achieved target power ``` - **Power** -- percentage of 1,600 simulations where the test reached significance. - **Target** -- your target power level (default: 80%). - **Status** -- `✓` if power meets the target, `✗` if it falls short. - In this example, `hp`, `wt`, `cyl[6]`, and `cyl[8]` all need a larger sample size to reach 80% power. --- ## Common Variations ### Alternatives Without pandas Pass a dict of lists instead of a DataFrame: ```python from mcpower import MCPower data = { "hp": [110, 93, 175, 105, 245], "wt": [2.62, 2.32, 3.21, 3.15, 3.44], "cyl": [6, 4, 8, 6, 8], } model = MCPower("mpg = hp + wt + cyl") model.upload_data(data) model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80") model.find_power(sample_size=100) ``` ### String Columns as Factors String columns are automatically detected as factors when they have 2--20 unique values: ```python data = pd.read_csv("cars_with_origin.csv") model = MCPower("mpg = origin + hp") model.upload_data(data[["origin", "hp"]]) # origin has values ["Europe", "Japan", "USA"] → factor # Reference: "Europe" (first alphabetically) # Dummies: origin[Japan], origin[USA] model.set_effects("origin[Japan]=0.20, origin[USA]=0.50, hp=0.25") model.find_power(sample_size=120) ``` ### Correlation Preservation Modes The `preserve_correlation` parameter controls how MCPower handles relationships between uploaded variables: ```python # "strict" (default) — bootstrap whole rows, preserving exact relationships model.upload_data(data, preserve_correlation="strict") # "partial" — compute correlations from data, allow manual overrides model.upload_data(data, preserve_correlation="partial") model.set_correlations("corr(hp, wt)=0.6") # override one pair # "no" — ignore correlations from data entirely model.upload_data(data, preserve_correlation="no") model.set_correlations("corr(hp, wt)=0.3") # set all manually ``` | Mode | Correlation Source | Best For | |---|---|---| | `"strict"` (default) | Bootstrapped rows | Most realistic simulation | | `"partial"` | Data + manual overrides | Empirical baseline with adjustments | | `"no"` | Manual only | Full manual control | ### Overriding Auto-Detection with `data_types` If auto-detection classifies a variable incorrectly, override it: ```python # "rating" has 5 unique values → auto-detected as factor # Override to treat as continuous model.upload_data( data[["group", "score", "rating"]], data_types={"rating": "continuous"}, ) ``` You can also select the reference level for a factor: ```python # Numeric reference level model.upload_data( data[["hp", "wt", "cyl"]], data_types={"cyl": ("factor", 8)}, # cyl=8 becomes reference ) # Dummies are now: cyl[4], cyl[6] # String reference level model.upload_data( data[["origin", "hp"]], data_types={"origin": ("factor", "USA")}, # USA becomes reference ) # Dummies are now: origin[Europe], origin[Japan] ``` ### Named Factor Levels Without Data If you do not have data but want meaningful level names instead of integer indices, use `set_factor_levels()`: ```{code-cell} ipython3 :tags: [remove-output, remove-stderr] from mcpower import MCPower model = MCPower("outcome = group + age") model.set_simulations(400) model.set_variable_type("group=(factor,3)") model.set_factor_levels("group=placebo,low_dose,high_dose") # Now effects use named levels: model.set_effects("group[low_dose]=0.50, group[high_dose]=0.80, age=0.25") model.find_power(sample_size=150) ``` The first listed level (`placebo`) becomes the reference. This is purely a labeling feature -- it does not change the statistical computation. ### Mixing Uploaded and Synthetic Variables Variables in the formula that are **not** in the uploaded data are generated synthetically: ```python model = MCPower("outcome = hp + wt + treatment") model.upload_data(data[["hp", "wt"]]) # empirical distributions model.set_variable_type("treatment=binary") # synthetic variable model.set_effects("hp=0.25, wt=0.40, treatment=0.50") model.find_power(sample_size=100) ``` **Note:** In `"strict"` mode, cross-correlations between uploaded and non-uploaded variables are set to zero with a warning. ### Integer-Indexed Dummies Set `preserve_factor_level_names=False` to use integer-indexed dummies instead of data values: ```python model.upload_data(data[["hp", "wt", "cyl"]], preserve_factor_level_names=False) # cyl dummies are now: cyl[2], cyl[3] instead of cyl[6], cyl[8] model.set_effects("hp=0.25, wt=0.40, cyl[2]=0.50, cyl[3]=0.80") ``` The default (`True`) is recommended — it produces clearer, more readable output. --- ## Next Steps - **[Tutorial: CSV Preparation](csv-preparation.md)** -- formatting your CSV file correctly - **[Effect Sizes](../concepts/effect-sizes.md)** -- choosing appropriate effect sizes - **[Variable Types](../concepts/variable-types.md)** -- all available variable types - **[Correlations](../concepts/correlations.md)** -- setting predictor correlations - **[API Reference](../api/index.md)** -- full `upload_data()` parameter documentation