Tutorial: Power Analysis with Your Own Data

Goal

You have pilot data or a related dataset and want your power analysis to reflect real-world distributions instead of synthetic ones.


Full Working Example

import pandas as pd
from mcpower import MCPower

# ── Load your CSV ──────────────────────────────────────────────────
data = pd.read_csv("cars.csv")

# ── Define the model ───────────────────────────────────────────────
model = MCPower("mpg = hp + wt + cyl")

# ── Upload predictor columns ──────────────────────────────────────
model.upload_data(
    data[["hp", "wt", "cyl"]],
    preserve_correlation="strict",          # default — bootstrap whole rows
    preserve_factor_level_names=True,       # default — use original values as level names
)

# ── Set effect sizes (still required after upload) ────────────────
# cyl has values [4, 6, 8] → auto-detected as factor
# Reference level: 4 (first sorted value)
# Dummies: cyl[6] and cyl[8]
model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80")

# ── Run the analysis ──────────────────────────────────────────────
model.find_power(sample_size=100, target_test="all")

Step-by-Step Walkthrough

1. Load the data

data = pd.read_csv("cars.csv")

Read your CSV into a pandas DataFrame. MCPower also accepts a plain dict of lists (see Alternatives Without pandas below).

2. Define the model

model = MCPower("mpg = hp + wt + cyl")

Write the formula with the outcome on the left and predictors on the right, separated by = or ~. Only predictor columns are uploaded; the outcome is always simulated.

3. Upload predictor data

model.upload_data(
    data[["hp", "wt", "cyl"]],
    preserve_correlation="strict",
    preserve_factor_level_names=True,
)

Key points:

  • upload_data() returns self – it can be chained with other method calls.

  • Pass only the predictor columns that appear in your formula.

  • MCPower auto-detects variable types based on unique value counts:

Unique Values

Detected Type

1

Dropped (constant)

2

Binary

3–6

Factor

7+

Continuous

  • String columns with 2–20 unique values are automatically detected as factors.

In this example, hp and wt have many unique values (continuous), while cyl has three unique values [4, 6, 8] (factor).

4. Set effect sizes

model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80")

Uploading data provides distributions, not effect sizes. You must still specify how strongly each predictor relates to the outcome.

With preserve_factor_level_names=True (the default), factor dummies use the original data values as level names:

  • cyl[6] – comparing cylinders=6 to the reference (cylinders=4)

  • cyl[8] – comparing cylinders=8 to the reference (cylinders=4)

The reference level is the first sorted unique value (here, 4).

5. Run the analysis

model.find_power(sample_size=100, target_test="all")

target_test="all" reports power for every individual predictor plus the overall F-test.


Output Interpretation

================================================================================
MONTE CARLO POWER ANALYSIS RESULTS
================================================================================

Power Analysis Results (N=100):
Test                                     Power    Target   Status
-------------------------------------------------------------------
overall                                  100.0    80       ✓
hp                                       24.7     80       ✗
wt                                       64.8     80       ✗
cyl[6]                                   32.8     80       ✗
cyl[8]                                   34.0     80       ✗

Result: 1/5 tests achieved target power
  • Power – percentage of 1,600 simulations where the test reached significance.

  • Target – your target power level (default: 80%).

  • Status if power meets the target, if it falls short.

  • In this example, hp, wt, cyl[6], and cyl[8] all need a larger sample size to reach 80% power.


Common Variations

Alternatives Without pandas

Pass a dict of lists instead of a DataFrame:

from mcpower import MCPower

data = {
    "hp": [110, 93, 175, 105, 245],
    "wt": [2.62, 2.32, 3.21, 3.15, 3.44],
    "cyl": [6, 4, 8, 6, 8],
}

model = MCPower("mpg = hp + wt + cyl")
model.upload_data(data)
model.set_effects("hp=0.25, wt=0.40, cyl[6]=0.50, cyl[8]=0.80")
model.find_power(sample_size=100)

String Columns as Factors

String columns are automatically detected as factors when they have 2–20 unique values:

data = pd.read_csv("cars_with_origin.csv")

model = MCPower("mpg = origin + hp")
model.upload_data(data[["origin", "hp"]])
# origin has values ["Europe", "Japan", "USA"] → factor
# Reference: "Europe" (first alphabetically)
# Dummies: origin[Japan], origin[USA]
model.set_effects("origin[Japan]=0.20, origin[USA]=0.50, hp=0.25")
model.find_power(sample_size=120)

Correlation Preservation Modes

The preserve_correlation parameter controls how MCPower handles relationships between uploaded variables:

# "strict" (default) — bootstrap whole rows, preserving exact relationships
model.upload_data(data, preserve_correlation="strict")

# "partial" — compute correlations from data, allow manual overrides
model.upload_data(data, preserve_correlation="partial")
model.set_correlations("corr(hp, wt)=0.6")  # override one pair

# "no" — ignore correlations from data entirely
model.upload_data(data, preserve_correlation="no")
model.set_correlations("corr(hp, wt)=0.3")  # set all manually

Mode

Correlation Source

Best For

"strict" (default)

Bootstrapped rows

Most realistic simulation

"partial"

Data + manual overrides

Empirical baseline with adjustments

"no"

Manual only

Full manual control

Overriding Auto-Detection with data_types

If auto-detection classifies a variable incorrectly, override it:

# "rating" has 5 unique values → auto-detected as factor
# Override to treat as continuous
model.upload_data(
    data[["group", "score", "rating"]],
    data_types={"rating": "continuous"},
)

You can also select the reference level for a factor:

# Numeric reference level
model.upload_data(
    data[["hp", "wt", "cyl"]],
    data_types={"cyl": ("factor", 8)},  # cyl=8 becomes reference
)
# Dummies are now: cyl[4], cyl[6]

# String reference level
model.upload_data(
    data[["origin", "hp"]],
    data_types={"origin": ("factor", "USA")},  # USA becomes reference
)
# Dummies are now: origin[Europe], origin[Japan]

Named Factor Levels Without Data

If you do not have data but want meaningful level names instead of integer indices, use set_factor_levels():

from mcpower import MCPower

model = MCPower("outcome = group + age")
model.set_simulations(400)
model.set_variable_type("group=(factor,3)")
model.set_factor_levels("group=placebo,low_dose,high_dose")

# Now effects use named levels:
model.set_effects("group[low_dose]=0.50, group[high_dose]=0.80, age=0.25")
model.find_power(sample_size=150)

The first listed level (placebo) becomes the reference. This is purely a labeling feature – it does not change the statistical computation.

Mixing Uploaded and Synthetic Variables

Variables in the formula that are not in the uploaded data are generated synthetically:

model = MCPower("outcome = hp + wt + treatment")
model.upload_data(data[["hp", "wt"]])               # empirical distributions
model.set_variable_type("treatment=binary")           # synthetic variable
model.set_effects("hp=0.25, wt=0.40, treatment=0.50")
model.find_power(sample_size=100)

Note: In "strict" mode, cross-correlations between uploaded and non-uploaded variables are set to zero with a warning.

Integer-Indexed Dummies

Set preserve_factor_level_names=False to use integer-indexed dummies instead of data values:

model.upload_data(data[["hp", "wt", "cyl"]], preserve_factor_level_names=False)
# cyl dummies are now: cyl[2], cyl[3] instead of cyl[6], cyl[8]
model.set_effects("hp=0.25, wt=0.40, cyl[2]=0.50, cyl[3]=0.80")

The default (True) is recommended — it produces clearer, more readable output.


Next Steps