Data & Clustering

Methods for uploading empirical data, defining factor levels, and configuring mixed-effects clustering.


upload_data()

MCPower.upload_data(data, columns=None, preserve_correlation='strict', data_types=None, preserve_factor_level_names=True)[source]

Upload empirical data to preserve distribution shapes (deferred until apply()).

The uploaded data’s distribution will be used for generating simulated data. Auto-detects variable types based on unique value counts: - 1 unique value: dropped (constant) - 2 unique values: binary - 3-6 unique values: factor - 7+ unique values: continuous

Parameters:
  • data – Empirical data as: - dict of {var_name: array} - keys are variable names - numpy array (n_samples, n_vars) or (n_samples,) for single variable - list (1D or 2D) - pandas DataFrame - column names used automatically (requires pandas) When columns is not provided for array/list input, variables are auto-named column_1, column_2, etc.

  • columns (List[str] | None) – Variable names for numpy/list columns (optional; auto-generated if omitted)

  • preserve_correlation (str) – How to handle correlations between uploaded variables: - ‘no’: No correlation preservation (default generation) - ‘partial’: Compute correlations from data, merge with user correlations - ‘strict’: Bootstrap whole rows to preserve exact relationships (default)

  • data_types (Dict[str, str] | None) – Override auto-detection for specific variables. Simple: {“hp”: “continuous”, “cyl”: “factor”} With reference level: {“cyl”: (“factor”, 8)} or {“origin”: (“factor”, “USA”)} Valid types: “binary”, “factor”, “continuous”

  • preserve_factor_level_names (bool) – When True (default), factor dummy variables use original data values as level names (e.g., cyl[6], cyl[8]). When False, uses integer indices (e.g., cyl[2], cyl[3]).

Example

>>> # Using dict (no extra dependencies needed)
>>> import csv
>>> with open("my_data.csv") as f:
...     reader = csv.DictReader(f)
...     raw = list(reader)
>>> data = {col: [float(r[col]) for r in raw] for col in ["x1", "x2"]}
>>> model.upload_data(data)
>>> # Using numpy array with strict correlation preservation
>>> model.upload_data(
...     np.array([[1,2], [3,4], [5,6]]),
...     columns=["x1", "x2"],
...     preserve_correlation="strict"
... )
>>> # Using pandas DataFrame (requires: pip install pandas)
>>> import pandas as pd
>>> df = pd.read_csv("my_data.csv")
>>> model.upload_data(df[["x1", "x2"]])
>>> # Override auto-detection
>>> model.upload_data(data, data_types={"cyl": "factor", "hp": "continuous"})

Accepted Formats

Format

How Names Are Determined

pandas DataFrame

Column names used automatically

dict {str: array-like}

Dictionary keys used as variable names

numpy ndarray (2D)

Use columns parameter, or auto-named column_1, column_2, …

numpy ndarray (1D)

Single variable; use columns for its name

list (1D or 2D)

Same as numpy array

Auto-Detection Rules

Variable types are inferred from the number of unique values in each column:

Unique Values

Detected Type

Notes

1

Dropped

Constant column, warned and removed

2

Binary

Mapped to 0/1

3–6

Factor

Expanded into dummy variables

7+

Continuous

Normalized to mean=0, sd=1

String column, 2–20 unique

Factor

String values become level names

String column, >20 unique

Error

Too many levels for a factor

Correlation Modes

"strict" (default): Bootstraps whole rows from the uploaded data to preserve the exact multivariate relationships. Uploaded variables bypass the normal Cholesky generation pipeline entirely.

"partial": Computes a correlation matrix from the uploaded continuous variables and merges it with any user-specified correlations (via set_correlations()). User-specified correlations take precedence.

"no": No correlation information is extracted. Binary and factor columns use detected proportions. Continuous columns use lookup-table-based generation.

Type Overrides

The data_types parameter accepts a dictionary to override auto-detection:

# Simple override
model.upload_data(df, data_types={"hp": "continuous", "cyl": "factor"})

# Override with reference level
model.upload_data(df, data_types={"cyl": ("factor", 8)})
# cyl has values [4, 6, 8] -> reference is 8, dummies: cyl[4], cyl[6]

# String reference level
model.upload_data(df, data_types={"origin": ("factor", "USA")})
# origin has values ["Europe", "Japan", "USA"] -> reference is "USA"

Examples

import pandas as pd
from mcpower import MCPower

df = pd.read_csv("study_data.csv")

model = MCPower("mpg = hp + wt + cyl")
model.upload_data(df[["hp", "wt", "cyl"]])
# Auto-detected: hp=continuous, wt=continuous, cyl=factor (3 unique values)
# cyl dummies use original values: cyl[6], cyl[8] (reference: cyl[4])

model.set_effects("hp=0.3, wt=0.4, cyl[6]=0.2, cyl[8]=0.5")
model.find_power(sample_size=100)

Notes

  • Minimum data size: The uploaded data must have at least 25 observations. A warning is issued for fewer than 30.

  • Large sample warning: If the requested sample_size exceeds 3x the uploaded data count, a warning is printed about potential extrapolation.

  • Column matching: Uploaded columns are matched to formula predictors by name. Columns not in the formula are ignored. Formula predictors not in the data use standard generation.

  • String columns: Columns containing string values are auto-detected as factors (if 2–20 unique values). The first value in sorted order becomes the reference level by default.

See Also


set_factor_levels()

MCPower.set_factor_levels(spec)[source]

Define named factor levels without uploaded data.

The first listed level becomes the reference level.

Parameters:

spec (str) – Factor definitions. Format: "var=level1,level2,level3". Multiple factors separated by ;: "group=control,drug_a; dose=low,medium,high"

Returns:

For method chaining.

Return type:

self

Raises:
  • TypeError – If spec is not a string.

  • ValueError – If variable not in formula, or fewer than 2 levels (checked at apply time).

String Format

Single factor – the first listed level is the reference (omitted from dummies):

model.set_factor_levels("group=control,drug_a,drug_b")
# Creates dummies: group[drug_a], group[drug_b]
# Reference level: control

Multiple factors – separate with ;:

model.set_factor_levels("group=control,drug_a,drug_b; dose=low,medium,high")
# group dummies: group[drug_a], group[drug_b]   (reference: control)
# dose dummies:  dose[medium], dose[high]        (reference: low)

Examples

from mcpower import MCPower

model = MCPower("y = group + age")
model.set_simulations(400)
model.set_variable_type("group=(factor,3)")
model.set_factor_levels("group=placebo,low_dose,high_dose")
model.set_effects("group[low_dose]=0.3, group[high_dose]=0.6, age=0.2")
model.find_power(sample_size=120)

Notes

  • The variable must already exist in the formula.

  • The variable should be declared as a factor (via set_variable_type()) before or after calling set_factor_levels(). MCPower applies settings in the correct order regardless of call sequence.

  • If you upload data with upload_data() and preserve_factor_level_names=True (the default), level names are extracted automatically. In that case, set_factor_levels() is typically unnecessary.

  • Level names cannot contain commas, semicolons, or equals signs.

See Also


set_cluster()

MCPower.set_cluster(grouping_var, ICC=None, n_clusters=None, cluster_size=None, random_slopes=None, slope_variance=0.0, slope_intercept_corr=0.0, n_per_parent=None)[source]

Configure a cluster/grouping variable for random effects.

Sets up the clustering structure for a linear mixed-effects model. The grouping variable must correspond to a random-effect term in the formula. Specify either n_clusters or cluster_size — the other is derived from the sample size at analysis time.

This setting is deferred until apply() is called.

Parameters:
  • grouping_var (str) – Name of the grouping variable (must match a random-effect term in the formula).

  • ICC (float | None) – Intraclass correlation coefficient (0 <= ICC < 1). Determines the proportion of total variance attributable to between-cluster differences. Required for non-nested terms; for nested child terms, specifies the child-level ICC.

  • n_clusters (int | None) – Number of clusters. Mutually exclusive with cluster_size. Not required for nested child terms (derived from parent).

  • cluster_size (int | None) – Number of observations per cluster. Mutually exclusive with n_clusters.

  • random_slopes (List[str] | None) – List of predictor names with random slopes. Requires a (1 + x|group) term in the formula.

  • slope_variance (float) – Between-cluster variance of the random slope. Only meaningful when random_slopes is set.

  • slope_intercept_corr (float) – Correlation between random intercept and random slope. Must be in [-1, 1].

  • n_per_parent (int | None) – Number of sub-groups per parent group (required for nested effects when the formula has (1|A/B)).

Returns:

For method chaining.

Return type:

self

Raises:

ValueError – If grouping_var is not in the formula, both or neither of n_clusters/cluster_size are given, ICC is out of range, or slope parameters are invalid.

Example

>>> # Random intercept only (backward compatible)
>>> model = MCPower("y ~ x1 + x2 + (1|school)")
>>> model.set_cluster("school", ICC=0.2, n_clusters=20)
>>> # Random slopes with correlation
>>> model = MCPower("y ~ x1 + (1 + x1|school)")
>>> model.set_cluster("school", ICC=0.2, n_clusters=20,
...     random_slopes=["x1"], slope_variance=0.1,
...     slope_intercept_corr=0.3)
>>> # Nested: formula has (1|school/classroom)
>>> model = MCPower("y ~ treatment + (1|school/classroom)")
>>> model.set_cluster("school", ICC=0.15, n_clusters=10)
>>> model.set_cluster("classroom", ICC=0.10, n_per_parent=3)

Usage Patterns

Random intercept only – the simplest mixed-effects structure:

model = MCPower("y ~ treatment + (1|school)")
model.set_cluster("school", ICC=0.2, n_clusters=20)

Or specify cluster size instead of count:

model.set_cluster("school", ICC=0.2, cluster_size=25)

Random slopes – each cluster has its own intercept and slope:

model = MCPower("y ~ x1 + (1 + x1|school)")
model.set_cluster("school", ICC=0.2, n_clusters=20,
                   random_slopes=["x1"],
                   slope_variance=0.1,
                   slope_intercept_corr=0.3)

Nested random effects – call set_cluster() twice, parent first:

model = MCPower("y ~ treatment + (1|school/classroom)")

# Parent level
model.set_cluster("school", ICC=0.15, n_clusters=10)

# Child level (3 classrooms per school = 30 total classrooms)
model.set_cluster("classroom", ICC=0.10, n_per_parent=3)

Examples

from mcpower import MCPower

model = MCPower("satisfaction ~ treatment + motivation + (1|school)")
model.set_simulations(400)
model.set_cluster("school", ICC=0.2, n_clusters=20)
model.set_effects("treatment=0.5, motivation=0.3")
model.find_power(sample_size=600)  # 600 / 20 = 30 obs per cluster

Notes

  • ICC range: 0 (no clustering) or 0.1–0.9 for numerical stability. Values outside 0.1–0.9 (except 0) are rejected because extreme ICCs cause convergence issues in mixed models.

  • Minimum cluster size: At least 5 observations per cluster (enforced). A warning is issued if cluster size falls below 10.

  • Design effect: Clustering reduces effective sample size by 1 + (cluster_size - 1) * ICC. Higher ICC or larger clusters require more total observations for the same power.

  • Convergence failures: Complex cluster structures may cause some simulations to fail. Use model.set_max_failed_simulations(0.10) to allow up to 10% failures (default is 3%).

See Also