Tutorial: Power Analysis with Your Own Data¶
Goal¶
You have pilot data or a related dataset and want your power analysis to reflect real-world distributions instead of synthetic ones.
Full Working Example¶
import pandas as pd
from mcpower import MCPower
# ── Load your CSV ──────────────────────────────────────────────────
data = pd.read_csv("cars.csv")
# ── Define the model ───────────────────────────────────────────────
model = MCPower("mpg = hp + wt + cyl")
# ── Upload predictor columns ──────────────────────────────────────
model.upload_data(
data[["hp", "wt", "cyl"]],
preserve_correlation="strict", # default — bootstrap whole rows
preserve_factor_level_names=True, # default — use original values as level names
)
# ── Set effect sizes (still required after upload) ────────────────
# cyl has values [4, 6, 8] → auto-detected as factor
# Reference level: 4 (first sorted value)
# Dummies: cyl[6] and cyl[8]
model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80")
# ── Run the analysis ──────────────────────────────────────────────
model.find_power(sample_size=100, target_test="all")
Step-by-Step Walkthrough¶
1. Load the data¶
data = pd.read_csv("cars.csv")
Read your CSV into a pandas DataFrame. MCPower also accepts a plain dict of lists (see Alternatives Without pandas below).
2. Define the model¶
model = MCPower("mpg = hp + wt + cyl")
Write the formula with the outcome on the left and predictors on the right, separated by = or ~. Only predictor columns are uploaded; the outcome is always simulated.
3. Upload predictor data¶
model.upload_data(
data[["hp", "wt", "cyl"]],
preserve_correlation="strict",
preserve_factor_level_names=True,
)
Key points:
upload_data()returnsself– it can be chained with other method calls.Pass only the predictor columns that appear in your formula.
MCPower auto-detects variable types based on unique value counts:
Unique Values |
Detected Type |
|---|---|
1 |
Dropped (constant) |
2 |
Binary |
3–6 |
Factor |
7+ |
Continuous |
String columns with 2–20 unique values are automatically detected as factors.
In this example, hp and wt have many unique values (continuous), while cyl has three unique values [4, 6, 8] (factor).
4. Set effect sizes¶
model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80")
Uploading data provides distributions, not effect sizes. You must still specify how strongly each predictor relates to the outcome.
With preserve_factor_level_names=True (the default), factor dummies use the original data values as level names:
cyl[6]– comparing cylinders=6 to the reference (cylinders=4)cyl[8]– comparing cylinders=8 to the reference (cylinders=4)
The reference level is the first sorted unique value (here, 4).
5. Run the analysis¶
model.find_power(sample_size=100, target_test="all")
target_test="all" reports power for every individual predictor plus the overall F-test.
Output Interpretation¶
================================================================================
MONTE CARLO POWER ANALYSIS RESULTS
================================================================================
Power Analysis Results (N=100):
Test Power Target Status
-------------------------------------------------------------------
overall 100.0 80 ✓
hp 24.7 80 ✗
wt 64.8 80 ✗
cyl[6] 32.8 80 ✗
cyl[8] 34.0 80 ✗
Result: 1/5 tests achieved target power
Power – percentage of 1,600 simulations where the test reached significance.
Target – your target power level (default: 80%).
Status –
✓if power meets the target,✗if it falls short.In this example,
hp,wt,cyl[6], andcyl[8]all need a larger sample size to reach 80% power.
Common Variations¶
Alternatives Without pandas¶
Pass a dict of lists instead of a DataFrame:
from mcpower import MCPower
data = {
"hp": [110, 93, 175, 105, 245],
"wt": [2.62, 2.32, 3.21, 3.15, 3.44],
"cyl": [6, 4, 8, 6, 8],
}
model = MCPower("mpg = hp + wt + cyl")
model.upload_data(data)
model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80")
model.find_power(sample_size=100)
String Columns as Factors¶
String columns are automatically detected as factors when they have 2–20 unique values:
data = pd.read_csv("cars_with_origin.csv")
model = MCPower("mpg = origin + hp")
model.upload_data(data[["origin", "hp"]])
# origin has values ["Europe", "Japan", "USA"] → factor
# Reference: "Europe" (first alphabetically)
# Dummies: origin[Japan], origin[USA]
model.set_effects("origin[Japan]=0.20, origin[USA]=0.50, hp=0.25")
model.find_power(sample_size=120)
Correlation Preservation Modes¶
The preserve_correlation parameter controls how MCPower handles relationships between uploaded variables:
# "strict" (default) — bootstrap whole rows, preserving exact relationships
model.upload_data(data, preserve_correlation="strict")
# "partial" — compute correlations from data, allow manual overrides
model.upload_data(data, preserve_correlation="partial")
model.set_correlations("corr(hp, wt)=0.6") # override one pair
# "no" — ignore correlations from data entirely
model.upload_data(data, preserve_correlation="no")
model.set_correlations("corr(hp, wt)=0.3") # set all manually
Mode |
Correlation Source |
Best For |
|---|---|---|
|
Bootstrapped rows |
Most realistic simulation |
|
Data + manual overrides |
Empirical baseline with adjustments |
|
Manual only |
Full manual control |
Overriding Auto-Detection with data_types¶
If auto-detection classifies a variable incorrectly, override it:
# "rating" has 5 unique values → auto-detected as factor
# Override to treat as continuous
model.upload_data(
data[["group", "score", "rating"]],
data_types={"rating": "continuous"},
)
You can also select the reference level for a factor:
# Numeric reference level
model.upload_data(
data[["hp", "wt", "cyl"]],
data_types={"cyl": ("factor", 8)}, # cyl=8 becomes reference
)
# Dummies are now: cyl[4], cyl[6]
# String reference level
model.upload_data(
data[["origin", "hp"]],
data_types={"origin": ("factor", "USA")}, # USA becomes reference
)
# Dummies are now: origin[Europe], origin[Japan]
Named Factor Levels Without Data¶
If you do not have data but want meaningful level names instead of integer indices, use set_factor_levels():
from mcpower import MCPower
model = MCPower("outcome = group + age")
model.set_simulations(400)
model.set_variable_type("group=(factor,3)")
model.set_factor_levels("group=placebo,low_dose,high_dose")
# Now effects use named levels:
model.set_effects("group[low_dose]=0.50, group[high_dose]=0.80, age=0.25")
model.find_power(sample_size=150)
The first listed level (placebo) becomes the reference. This is purely a labeling feature – it does not change the statistical computation.
Mixing Uploaded and Synthetic Variables¶
Variables in the formula that are not in the uploaded data are generated synthetically:
model = MCPower("outcome = hp + wt + treatment")
model.upload_data(data[["hp", "wt"]]) # empirical distributions
model.set_variable_type("treatment=binary") # synthetic variable
model.set_effects("hp=0.25, wt=0.40, treatment=0.50")
model.find_power(sample_size=100)
Note: In "strict" mode, cross-correlations between uploaded and non-uploaded variables are set to zero with a warning.
Integer-Indexed Dummies¶
Set preserve_factor_level_names=False to use integer-indexed dummies instead of data values:
model.upload_data(data[["hp", "wt", "cyl"]], preserve_factor_level_names=False)
# cyl dummies are now: cyl[2], cyl[3] instead of cyl[6], cyl[8]
model.set_effects("hp=0.25, wt=0.40, cyl[2]=0.50, cyl[3]=0.80")
The default (True) is recommended — it produces clearer, more readable output.
Next Steps¶
Tutorial: CSV Preparation – formatting your CSV file correctly
Effect Sizes – choosing appropriate effect sizes
Variable Types – all available variable types
Correlations – setting predictor correlations
API Reference – full
upload_data()parameter documentation