Data & Clustering¶
Methods for uploading empirical data, defining factor levels, and configuring mixed-effects clustering.
upload_data()¶
- MCPower.upload_data(data, columns=None, preserve_correlation='strict', data_types=None, preserve_factor_level_names=True)[source]¶
Upload empirical data to preserve distribution shapes (deferred until apply()).
The uploaded data’s distribution will be used for generating simulated data. Auto-detects variable types based on unique value counts: - 1 unique value: dropped (constant) - 2 unique values: binary - 3-6 unique values: factor - 7+ unique values: continuous
- Parameters:
data – Empirical data as: - dict of {var_name: array} - keys are variable names - numpy array (n_samples, n_vars) or (n_samples,) for single variable - list (1D or 2D) - pandas DataFrame - column names used automatically (requires pandas) When columns is not provided for array/list input, variables are auto-named column_1, column_2, etc.
columns (List[str] | None) – Variable names for numpy/list columns (optional; auto-generated if omitted)
preserve_correlation (str) – How to handle correlations between uploaded variables: - ‘no’: No correlation preservation (default generation) - ‘partial’: Compute correlations from data, merge with user correlations - ‘strict’: Bootstrap whole rows to preserve exact relationships (default)
data_types (Dict[str, str] | None) – Override auto-detection for specific variables. Simple: {“hp”: “continuous”, “cyl”: “factor”} With reference level: {“cyl”: (“factor”, 8)} or {“origin”: (“factor”, “USA”)} Valid types: “binary”, “factor”, “continuous”
preserve_factor_level_names (bool) – When True (default), factor dummy variables use original data values as level names (e.g., cyl[6], cyl[8]). When False, uses integer indices (e.g., cyl[2], cyl[3]).
Example
>>> # Using dict (no extra dependencies needed) >>> import csv >>> with open("my_data.csv") as f: ... reader = csv.DictReader(f) ... raw = list(reader) >>> data = {col: [float(r[col]) for r in raw] for col in ["x1", "x2"]} >>> model.upload_data(data)
>>> # Using numpy array with strict correlation preservation >>> model.upload_data( ... np.array([[1,2], [3,4], [5,6]]), ... columns=["x1", "x2"], ... preserve_correlation="strict" ... )
>>> # Using pandas DataFrame (requires: pip install pandas) >>> import pandas as pd >>> df = pd.read_csv("my_data.csv") >>> model.upload_data(df[["x1", "x2"]])
>>> # Override auto-detection >>> model.upload_data(data, data_types={"cyl": "factor", "hp": "continuous"})
Accepted Formats¶
Format |
How Names Are Determined |
|---|---|
pandas DataFrame |
Column names used automatically |
dict |
Dictionary keys used as variable names |
numpy ndarray (2D) |
Use |
numpy ndarray (1D) |
Single variable; use |
list (1D or 2D) |
Same as numpy array |
Auto-Detection Rules¶
Variable types are inferred from the number of unique values in each column:
Unique Values |
Detected Type |
Notes |
|---|---|---|
1 |
Dropped |
Constant column, warned and removed |
2 |
Binary |
Mapped to 0/1 |
3–6 |
Factor |
Expanded into dummy variables |
7+ |
Continuous |
Normalized to mean=0, sd=1 |
String column, 2–20 unique |
Factor |
String values become level names |
String column, >20 unique |
Error |
Too many levels for a factor |
Correlation Modes¶
"strict" (default): Bootstraps whole rows from the uploaded data to preserve the exact multivariate relationships. Uploaded variables bypass the normal Cholesky generation pipeline entirely.
"partial": Computes a correlation matrix from the uploaded continuous variables and merges it with any user-specified correlations (via set_correlations()). User-specified correlations take precedence.
"no": No correlation information is extracted. Binary and factor columns use detected proportions. Continuous columns use lookup-table-based generation.
Type Overrides¶
The data_types parameter accepts a dictionary to override auto-detection:
# Simple override
model.upload_data(df, data_types={"hp": "continuous", "cyl": "factor"})
# Override with reference level
model.upload_data(df, data_types={"cyl": ("factor", 8)})
# cyl has values [4, 6, 8] -> reference is 8, dummies: cyl[4], cyl[6]
# String reference level
model.upload_data(df, data_types={"origin": ("factor", "USA")})
# origin has values ["Europe", "Japan", "USA"] -> reference is "USA"
Examples¶
import pandas as pd
from mcpower import MCPower
df = pd.read_csv("study_data.csv")
model = MCPower("mpg = hp + wt + cyl")
model.upload_data(df[["hp", "wt", "cyl"]])
# Auto-detected: hp=continuous, wt=continuous, cyl=factor (3 unique values)
# cyl dummies use original values: cyl[6], cyl[8] (reference: cyl[4])
model.set_effects("hp=0.3, wt=0.4, cyl[6]=0.2, cyl[8]=0.5")
model.find_power(sample_size=100)
Notes¶
Minimum data size: The uploaded data must have at least 25 observations. A warning is issued for fewer than 30.
Large sample warning: If the requested
sample_sizeexceeds 3x the uploaded data count, a warning is printed about potential extrapolation.Column matching: Uploaded columns are matched to formula predictors by name. Columns not in the formula are ignored. Formula predictors not in the data use standard generation.
String columns: Columns containing string values are auto-detected as factors (if 2–20 unique values). The first value in sorted order becomes the reference level by default.
See Also¶
Tutorial: Using Your Own Data – Step-by-step walkthrough with real data
set_variable_type() – Manual type specification (alternative to auto-detection)
set_factor_levels() – Define named levels without data
set_correlations() – Manual correlation specification
set_factor_levels()¶
- MCPower.set_factor_levels(spec)[source]¶
Define named factor levels without uploaded data.
The first listed level becomes the reference level.
- Parameters:
spec (str) – Factor definitions. Format:
"var=level1,level2,level3". Multiple factors separated by;:"group=control,drug_a; dose=low,medium,high"- Returns:
For method chaining.
- Return type:
self
- Raises:
TypeError – If spec is not a string.
ValueError – If variable not in formula, or fewer than 2 levels (checked at apply time).
String Format¶
Single factor – the first listed level is the reference (omitted from dummies):
model.set_factor_levels("group=control,drug_a,drug_b")
# Creates dummies: group[drug_a], group[drug_b]
# Reference level: control
Multiple factors – separate with ;:
model.set_factor_levels("group=control,drug_a,drug_b; dose=low,medium,high")
# group dummies: group[drug_a], group[drug_b] (reference: control)
# dose dummies: dose[medium], dose[high] (reference: low)
Examples¶
from mcpower import MCPower
model = MCPower("y = group + age")
model.set_simulations(400)
model.set_variable_type("group=(factor,3)")
model.set_factor_levels("group=placebo,low_dose,high_dose")
model.set_effects("group[low_dose]=0.3, group[high_dose]=0.6, age=0.2")
model.find_power(sample_size=120)
Notes¶
The variable must already exist in the formula.
The variable should be declared as a factor (via
set_variable_type()) before or after callingset_factor_levels(). MCPower applies settings in the correct order regardless of call sequence.If you upload data with
upload_data()andpreserve_factor_level_names=True(the default), level names are extracted automatically. In that case,set_factor_levels()is typically unnecessary.Level names cannot contain commas, semicolons, or equals signs.
See Also¶
set_variable_type() – Declare factor variables
set_effects() – Set effects using named level bracket notation
upload_data() – Automatic named levels from empirical data
ANOVA & Post-Hoc Tests – Factor comparisons and post-hoc tests
Variable Types – Factor variable concepts
set_cluster()¶
- MCPower.set_cluster(grouping_var, ICC=None, n_clusters=None, cluster_size=None, random_slopes=None, slope_variance=0.0, slope_intercept_corr=0.0, n_per_parent=None)[source]¶
Configure a cluster/grouping variable for random effects.
Sets up the clustering structure for a linear mixed-effects model. The grouping variable must correspond to a random-effect term in the formula. Specify either n_clusters or cluster_size — the other is derived from the sample size at analysis time.
This setting is deferred until
apply()is called.- Parameters:
grouping_var (str) – Name of the grouping variable (must match a random-effect term in the formula).
ICC (float | None) – Intraclass correlation coefficient (0 <= ICC < 1). Determines the proportion of total variance attributable to between-cluster differences. Required for non-nested terms; for nested child terms, specifies the child-level ICC.
n_clusters (int | None) – Number of clusters. Mutually exclusive with cluster_size. Not required for nested child terms (derived from parent).
cluster_size (int | None) – Number of observations per cluster. Mutually exclusive with n_clusters.
random_slopes (List[str] | None) – List of predictor names with random slopes. Requires a
(1 + x|group)term in the formula.slope_variance (float) – Between-cluster variance of the random slope. Only meaningful when random_slopes is set.
slope_intercept_corr (float) – Correlation between random intercept and random slope. Must be in [-1, 1].
n_per_parent (int | None) – Number of sub-groups per parent group (required for nested effects when the formula has
(1|A/B)).
- Returns:
For method chaining.
- Return type:
self
- Raises:
ValueError – If grouping_var is not in the formula, both or neither of n_clusters/cluster_size are given, ICC is out of range, or slope parameters are invalid.
Example
>>> # Random intercept only (backward compatible) >>> model = MCPower("y ~ x1 + x2 + (1|school)") >>> model.set_cluster("school", ICC=0.2, n_clusters=20)
>>> # Random slopes with correlation >>> model = MCPower("y ~ x1 + (1 + x1|school)") >>> model.set_cluster("school", ICC=0.2, n_clusters=20, ... random_slopes=["x1"], slope_variance=0.1, ... slope_intercept_corr=0.3)
>>> # Nested: formula has (1|school/classroom) >>> model = MCPower("y ~ treatment + (1|school/classroom)") >>> model.set_cluster("school", ICC=0.15, n_clusters=10) >>> model.set_cluster("classroom", ICC=0.10, n_per_parent=3)
Usage Patterns¶
Random intercept only – the simplest mixed-effects structure:
model = MCPower("y ~ treatment + (1|school)")
model.set_cluster("school", ICC=0.2, n_clusters=20)
Or specify cluster size instead of count:
model.set_cluster("school", ICC=0.2, cluster_size=25)
Random slopes – each cluster has its own intercept and slope:
model = MCPower("y ~ x1 + (1 + x1|school)")
model.set_cluster("school", ICC=0.2, n_clusters=20,
random_slopes=["x1"],
slope_variance=0.1,
slope_intercept_corr=0.3)
Nested random effects – call set_cluster() twice, parent first:
model = MCPower("y ~ treatment + (1|school/classroom)")
# Parent level
model.set_cluster("school", ICC=0.15, n_clusters=10)
# Child level (3 classrooms per school = 30 total classrooms)
model.set_cluster("classroom", ICC=0.10, n_per_parent=3)
Examples¶
from mcpower import MCPower
model = MCPower("satisfaction ~ treatment + motivation + (1|school)")
model.set_simulations(400)
model.set_cluster("school", ICC=0.2, n_clusters=20)
model.set_effects("treatment=0.5, motivation=0.3")
model.find_power(sample_size=600) # 600 / 20 = 30 obs per cluster
Notes¶
ICC range: 0 (no clustering) or 0.1–0.9 for numerical stability. Values outside 0.1–0.9 (except 0) are rejected because extreme ICCs cause convergence issues in mixed models.
Minimum cluster size: At least 5 observations per cluster (enforced). A warning is issued if cluster size falls below 10.
Design effect: Clustering reduces effective sample size by
1 + (cluster_size - 1) * ICC. Higher ICC or larger clusters require more total observations for the same power.Convergence failures: Complex cluster structures may cause some simulations to fail. Use
model.set_max_failed_simulations(0.10)to allow up to 10% failures (default is 3%).
See Also¶
Mixed-Effects Models – Full conceptual guide to clustering, ICC, and design effects
Tutorial: Mixed-Effects Models – Step-by-step mixed model analysis
MCPower() Constructor – Random-effect formula syntax
find_power() – Running the analysis after configuring clusters