---
jupytext:
  text_representation:
    format_name: myst
kernelspec:
  name: python3
---

# Tutorial: Random Intercepts for Clustered Data

## Goal

You have clustered data (e.g., students in schools) and need to account for the clustering in your power analysis.

```{code-cell} ipython3
:tags: [remove-input, remove-output]
import numpy as np
np.random.seed(42)
import warnings
warnings.filterwarnings("ignore", message="Low simulation")
```

---

## Table of Contents

- [Full Working Example](#full-working-example)
- [Step-by-Step Walkthrough](#step-by-step-walkthrough)
- [Statistical Model](#statistical-model)
- [Design Recommendations](#design-recommendations)
- [Common Variations](#common-variations)
- [Next Steps](#next-steps)

---

## Full Working Example

```{code-cell} ipython3
:tags: [remove-stderr]
from mcpower import MCPower

# 1. Define the model with a random intercept for school
model = MCPower("satisfaction ~ treatment + motivation + (1|school)")
model.set_simulations(400)

# 2. Configure the clustering structure
model.set_cluster("school", ICC=0.2, n_clusters=20)

# 3. Set the expected effect sizes (standardized coefficients)
model.set_effects("treatment=0.5, motivation=0.3")

# 4. Allow up to 10% convergence failures
model.set_max_failed_simulations(0.10)

# 5. Run the power analysis
model.find_power(sample_size=1000)
```

Copy this script into a Python file and run it. With 1000 observations split across 20 schools (50 students per school), you should see power well above 80% for both effects.

---

## Step-by-Step Walkthrough

### Step 1: Define the model formula

```{code-cell} ipython3
:tags: [remove-output]
model = MCPower("satisfaction ~ treatment + motivation + (1|school)")
model.set_simulations(400)
```

The formula uses R-style syntax. The key part is `(1|school)`, which tells MCPower that observations are clustered within schools and each school has its own random intercept. The `1` means "intercept" -- every school gets a baseline shift up or down from the grand mean.

Both `~` and `=` are valid as the formula separator.

### Step 2: Configure the clustering

```{code-cell} ipython3
:tags: [remove-output]
model.set_cluster("school", ICC=0.2, n_clusters=20)
```

- **`"school"`** -- matches the grouping variable name in the formula.
- **`ICC=0.2`** -- the intraclass correlation coefficient. An ICC of 0.2 means that 20% of the total variance in satisfaction is due to between-school differences. This is a typical value for educational research.
- **`n_clusters=20`** -- the number of schools in the study. With `sample_size=1000` and 20 clusters, each school will have 50 students.

You must specify either `n_clusters` or `cluster_size`, but not both:

```{code-cell} ipython3
:tags: [remove-output]
# These two are equivalent when sample_size=1000:
model.set_cluster("school", ICC=0.2, n_clusters=20)    # 1000/20 = 50 per school
model.set_cluster("school", ICC=0.2, cluster_size=50)   # 1000/50 = 20 schools
```

### Step 3: Set effect sizes

```{code-cell} ipython3
:tags: [remove-output]
model.set_effects("treatment=0.5, motivation=0.3")
```

These are standardized regression coefficients (beta weights):
- **`treatment=0.5`** -- a medium-large effect. Receiving treatment increases satisfaction by 0.5 standard deviations.
- **`motivation=0.3`** -- a small-to-medium effect. Each 1 SD increase in motivation predicts a 0.3 SD increase in satisfaction.

### Step 4: Set the convergence failure threshold

```{code-cell} ipython3
:tags: [remove-output]
model.set_max_failed_simulations(0.10)
```

Mixed-effects models occasionally fail to converge during individual simulations. The default threshold is 3% (`0.03`). For random-intercept models, 10% is a safe tolerance -- most simulations will converge. If the failure rate exceeds this threshold, MCPower raises an error, signaling that your design may have too few clusters or too few observations per cluster.

### Step 5: Run the analysis

```{code-cell} ipython3
:tags: [remove-stderr]
model.find_power(sample_size=1000)
```

MCPower runs 800 Monte Carlo simulations (the default for mixed models). In each simulation, it:
1. Generates synthetic clustered data with the specified ICC and effects
2. Fits a linear mixed-effects model using the C++ native solver
3. Tests each fixed effect for significance at alpha = 0.05

Power is the proportion of simulations where each effect is significant.

---

## Statistical Model

The random intercept model is:

$$y_{ij} = \mathbf{x}_{ij}'\boldsymbol{\beta} + b_j + e_{ij}$$

$$b_j \sim \mathcal{N}(0, \tau^2), \quad e_{ij} \sim \mathcal{N}(0, \sigma^2)$$

Where:
- $y_{ij}$ is the outcome for observation $i$ in cluster $j$
- $\mathbf{x}_{ij}'\boldsymbol{\beta}$ is the fixed-effects component (treatment, motivation)
- $b_j$ is the random intercept for cluster $j$ -- the school-level deviation from the grand mean
- $e_{ij}$ is the residual error for observation $i$ in cluster $j$

The between-cluster variance $\tau^2$ is derived from the ICC:

$$\tau^2 = \frac{\text{ICC}}{1 - \text{ICC}} \cdot \sigma^2_{\text{within}}$$

where $\sigma^2_{\text{within}}$ includes both the residual variance and the variance explained by fixed effects. Since MCPower standardizes continuous predictors to variance 1, this simplifies to:

$$\sigma^2_{\text{within}} = 1 + \sum_k \beta_k^2 \cdot \text{Var}(X_k)$$

MCPower automatically adjusts $\tau^2$ upward so that the actual ICC in the generated data matches the value you specify, even after accounting for the variance explained by fixed effects.

---

## Design Recommendations

MCPower assigns treatment at the individual level within each cluster. Because the treatment contrast is estimated within each cluster, the cluster random intercept largely cancels out. As a result, **ICC and cluster allocation have minimal impact on fixed-effect power** -- power depends primarily on total N.

> **Note:** This applies because MCPower generates treatment at the individual level within each cluster. In cluster-randomized trials where entire clusters receive the same treatment, the design effect (`1 + (m-1) × ICC`) would substantially reduce power.

The number of clusters matters for convergence and estimation stability, not for power:

- **Minimum:** 5 observations per cluster (enforced by MCPower; warning below 10).
- **Recommended:** 10--20 clusters for stable random effect estimation and convergence.

---

## Common Variations

### Using cluster_size instead of n_clusters

If you know the cluster size rather than the number of clusters:

```{code-cell} ipython3
:tags: [remove-stderr]
model = MCPower("satisfaction ~ treatment + motivation + (1|school)")
model.set_simulations(400)
model.set_cluster("school", ICC=0.2, cluster_size=50)
model.set_effects("treatment=0.5, motivation=0.3")
model.set_max_failed_simulations(0.10)
model.find_power(sample_size=1000)  # 1000/50 = 20 schools
```

### Finding the required sample size

Search for the minimum sample size that achieves 80% power:

```{code-cell} ipython3
:tags: [remove-stderr]
model = MCPower("satisfaction ~ treatment + motivation + (1|school)")
model.set_simulations(400)
model.set_cluster("school", ICC=0.2, n_clusters=20)
model.set_effects("treatment=0.5, motivation=0.3")
model.set_max_failed_simulations(0.10)
model.find_sample_size(
    from_size=200,
    to_size=2000,
    by=200,
    target_test="treatment",
)
```

Note: as the total sample size changes, the cluster size changes too (sample_size / n_clusters). MCPower handles this automatically.

### Binary treatment variable

If treatment is a binary variable (0/1):

```{code-cell} ipython3
:tags: [remove-stderr]
model = MCPower("satisfaction ~ treatment + motivation + (1|school)")
model.set_simulations(400)
model.set_cluster("school", ICC=0.2, n_clusters=20)
model.set_variable_type("treatment=binary")
model.set_effects("treatment=0.5, motivation=0.3")
model.set_max_failed_simulations(0.10)
model.find_power(sample_size=1000)
```

### Multiple fixed effects

Add as many fixed effects as your design requires:

```{code-cell} ipython3
:tags: [remove-stderr]
model = MCPower("satisfaction ~ treatment + motivation + age + gender + (1|school)")
model.set_simulations(400)
model.set_cluster("school", ICC=0.2, n_clusters=30)
model.set_variable_type("treatment=binary, gender=binary")
model.set_effects("treatment=0.5, motivation=0.3, age=0.1, gender=0.2")
model.set_max_failed_simulations(0.10)
model.find_power(sample_size=1500)
```

---

## Next Steps

- **[Tutorial: Random Slopes](mixed-slopes.md)** -- When the treatment effect varies across clusters
- **[Tutorial: Nested Random Effects](mixed-nested.md)** -- Students in classrooms in schools
- **[Mixed-Effects Models](../concepts/mixed-effects.md)** -- Overview of all mixed-effects features
- **[Effect Sizes](../concepts/effect-sizes.md)** -- Guidelines for choosing standardized effect sizes
- **[API Reference](../api/index.md)** -- Full parameter documentation for `set_cluster()` and `find_power()`