Correlations¶
What It Is¶
In real research, predictors are rarely independent. Income and education are correlated. Age and work experience move together. When predictors share variance, each one explains less unique variance in the outcome, which increases standard errors and reduces statistical power.
MCPower generates correlated predictor data using Cholesky decomposition, then transforms each variable to its target distribution while preserving the correlation structure. The effect of correlations on power depends on what you are testing:
Main effects: Correlation between predictors reduces power (shared variance → less unique variance per predictor → larger standard errors).
Interaction effects: Correlation between predictors increases power (correlated predictors produce a more variable interaction term → the interaction effect is easier to detect).
One important constraint: factor variables (categorical with 3+ levels) cannot be correlated through the standard correlation interface. If you need correlated categorical and continuous variables, upload empirical data with preserve_correlation="strict", which bootstraps whole rows to preserve the exact relationships in your dataset.
How It Works in MCPower¶
Set correlations with set_correlations():
model.set_correlations("corr(income, education)=0.5, corr(income, age)=0.3")
Or provide a full correlation matrix as a numpy array:
model.set_correlations(np.array([[1.0, 0.4], [0.4, 1.0]]))
Unspecified pairs default to r = 0 (independent). The matrix must be positive semi-definite (PSD).
Guidelines¶
Correlation Magnitudes (Main Effects)¶
Range |
Label |
Sample Size Impact |
Action |
|---|---|---|---|
0.00–0.20 |
Negligible |
~1.00x (no increase needed) |
Safe to ignore |
0.20–0.40 |
Small |
1.07–1.20x |
Include if known |
0.40–0.60 |
Moderate |
1.20–1.53x |
Always include |
0.60–0.70 |
Large |
1.53–1.87x |
Include; consider dropping one predictor |
0.70+ |
Very large |
>1.87x; multicollinearity risk |
Check if both predictors are needed |
Positive Semi-Definite (PSD) Requirement¶
A correlation matrix is PSD when all the correlations are mutually consistent – i.e., they could actually occur together in real data. MCPower checks this automatically and raises an error if not.
Example of an invalid (non-PSD) combination: If A and B are strongly correlated (r=0.9) and A and C are strongly correlated (r=0.9), then B and C must also be positively correlated – they can’t be negatively correlated (r=-0.9) because that contradicts the first two relationships.
If you get a PSD error, reduce the most extreme correlations or check that the signs are logically consistent.
Constraints¶
Factor variables cannot be correlated through
set_correlations().Correlations are symmetric:
corr(x1, x2)=0.3andcorr(x2, x1)=0.3are identical.Matrix dimensions must match the number of non-factor variables, in formula order.
For correlated factors, use
upload_data()withpreserve_correlation="strict".
Common Patterns¶
Correlation Preservation Modes with Uploaded Data¶
Mode |
Source |
Best For |
|---|---|---|
|
Manual only |
Full manual control |
|
Computed from data + manual overrides |
Empirical baseline with adjustments |
|
Bootstrapped rows |
Most realistic simulation from pilot data |
Typical Correlations by Domain¶
Domain |
Predictor Pair |
Typical r |
|---|---|---|
Education |
SES and test scores |
0.30–0.50 |
Psychology |
Anxiety and depression |
0.40–0.70 |
Medicine |
Age and blood pressure |
0.20–0.40 |
Social science |
Income and education |
0.40–0.60 |
Marketing |
Ad spend and brand awareness |
0.20–0.40 |
Impact on Required Sample Size (Main Effects)¶
Correlation (r) |
Required N |
Multiplier |
|---|---|---|
0.00 |
375 |
1.00x |
0.10 |
375 |
1.00x |
0.20 |
375 |
1.00x |
0.30 |
400 |
1.07x |
0.40 |
450 |
1.20x |
0.50 |
500 |
1.33x |
0.60 |
575 |
1.53x |
0.70 |
700 |
1.87x |
Simulation: y ~ x1 + x2, both effects=0.15, 1600 simulations, seed=42.
Correlations below 0.30 have negligible impact. Above 0.50, the sample size increase becomes substantial.
# Try it yourself: correlation impact on required sample size
from mcpower import MCPower
for r in [0.0, 0.10, 0.30, 0.50, 0.70]:
model = MCPower("y ~ x1 + x2")
model.set_effects("x1=0.15, x2=0.15")
if r > 0:
model.set_correlations(f"corr(x1, x2)={r}")
model.set_seed(42)
model.set_simulations(1600)
model.find_sample_size(
from_size=50, to_size=2000, by=25,
target_test="x1",
)
Impact on Interaction Power¶
For interactions, the effect is reversed – correlation helps:
| Correlation (|r|) | Required N for interaction | Multiplier | |—|—|—| | 0.00 | 850 | 1.00x | | 0.30 | 700 | 0.82x | | 0.50 | 650 | 0.76x |
Simulation: y ~ x1 + x2 + x1:x2, main effects=0.15, interaction=0.10, 1600 simulations, seed=42.
# Try it yourself: correlation impact on interaction power
from mcpower import MCPower
for r in [0.0, 0.30, 0.50]:
model = MCPower("y ~ x1 + x2 + x1:x2")
model.set_effects("x1=0.15, x2=0.15, x1:x2=0.10")
if r > 0:
model.set_correlations(f"corr(x1, x2)={r}")
model.set_seed(42)
model.set_simulations(1600)
model.find_sample_size(
from_size=50, to_size=3000, by=50,
target_test="x1:x2",
)
If your model has both main effects and interactions, correlation creates a trade-off: main effects need more observations while interactions need fewer. The sign of the correlation or the effects does not matter – only |r|.
Impact with Multiple Predictors¶
With three correlated predictors, the impact compounds:
Pairwise Correlation |
Required N |
Multiplier |
|---|---|---|
All r=0.00 |
350 |
1.00x |
All r=0.30 |
400 |
1.14x |
All r=0.50 |
550 |
1.57x |
Simulation: y ~ x1 + x2 + x3, all effects=0.15, 1600 simulations, seed=42.
Learn More¶
Uploading Data – correlation preservation from empirical data
API Reference: set_correlations – full parameter documentation