Correlations

What It Is

In real research, predictors are rarely independent. Income and education are correlated. Age and work experience move together. When predictors share variance, each one explains less unique variance in the outcome, which increases standard errors and reduces statistical power.

MCPower generates correlated predictor data using Cholesky decomposition, then transforms each variable to its target distribution while preserving the correlation structure. The effect of correlations on power depends on what you are testing:

  • Main effects: Correlation between predictors reduces power (shared variance → less unique variance per predictor → larger standard errors).

  • Interaction effects: Correlation between predictors increases power (correlated predictors produce a more variable interaction term → the interaction effect is easier to detect).

One important constraint: factor variables (categorical with 3+ levels) cannot be correlated through the standard correlation interface. If you need correlated categorical and continuous variables, upload empirical data with preserve_correlation="strict", which bootstraps whole rows to preserve the exact relationships in your dataset.


How It Works in MCPower

Set correlations with set_correlations():

model.set_correlations("corr(income, education)=0.5, corr(income, age)=0.3")

Or provide a full correlation matrix as a numpy array:

model.set_correlations(np.array([[1.0, 0.4], [0.4, 1.0]]))

Unspecified pairs default to r = 0 (independent). The matrix must be positive semi-definite (PSD).


Guidelines

Correlation Magnitudes (Main Effects)

Range

Label

Sample Size Impact

Action

0.00–0.20

Negligible

~1.00x (no increase needed)

Safe to ignore

0.20–0.40

Small

1.07–1.20x

Include if known

0.40–0.60

Moderate

1.20–1.53x

Always include

0.60–0.70

Large

1.53–1.87x

Include; consider dropping one predictor

0.70+

Very large

>1.87x; multicollinearity risk

Check if both predictors are needed

Positive Semi-Definite (PSD) Requirement

A correlation matrix is PSD when all the correlations are mutually consistent – i.e., they could actually occur together in real data. MCPower checks this automatically and raises an error if not.

Example of an invalid (non-PSD) combination: If A and B are strongly correlated (r=0.9) and A and C are strongly correlated (r=0.9), then B and C must also be positively correlated – they can’t be negatively correlated (r=-0.9) because that contradicts the first two relationships.

If you get a PSD error, reduce the most extreme correlations or check that the signs are logically consistent.

Constraints

  • Factor variables cannot be correlated through set_correlations().

  • Correlations are symmetric: corr(x1, x2)=0.3 and corr(x2, x1)=0.3 are identical.

  • Matrix dimensions must match the number of non-factor variables, in formula order.

  • For correlated factors, use upload_data() with preserve_correlation="strict".


Common Patterns

Correlation Preservation Modes with Uploaded Data

Mode

Source

Best For

"no"

Manual only

Full manual control

"partial"

Computed from data + manual overrides

Empirical baseline with adjustments

"strict" (default)

Bootstrapped rows

Most realistic simulation from pilot data

Typical Correlations by Domain

Domain

Predictor Pair

Typical r

Education

SES and test scores

0.30–0.50

Psychology

Anxiety and depression

0.40–0.70

Medicine

Age and blood pressure

0.20–0.40

Social science

Income and education

0.40–0.60

Marketing

Ad spend and brand awareness

0.20–0.40

Impact on Required Sample Size (Main Effects)

Correlation (r)

Required N

Multiplier

0.00

375

1.00x

0.10

375

1.00x

0.20

375

1.00x

0.30

400

1.07x

0.40

450

1.20x

0.50

500

1.33x

0.60

575

1.53x

0.70

700

1.87x

Simulation: y ~ x1 + x2, both effects=0.15, 1600 simulations, seed=42.

Correlations below 0.30 have negligible impact. Above 0.50, the sample size increase becomes substantial.

# Try it yourself: correlation impact on required sample size
from mcpower import MCPower

for r in [0.0, 0.10, 0.30, 0.50, 0.70]:
    model = MCPower("y ~ x1 + x2")
    model.set_effects("x1=0.15, x2=0.15")
    if r > 0:
        model.set_correlations(f"corr(x1, x2)={r}")
    model.set_seed(42)
    model.set_simulations(1600)
    model.find_sample_size(
        from_size=50, to_size=2000, by=25,
        target_test="x1",
    )

Impact on Interaction Power

For interactions, the effect is reversed – correlation helps:

| Correlation (|r|) | Required N for interaction | Multiplier | |—|—|—| | 0.00 | 850 | 1.00x | | 0.30 | 700 | 0.82x | | 0.50 | 650 | 0.76x |

Simulation: y ~ x1 + x2 + x1:x2, main effects=0.15, interaction=0.10, 1600 simulations, seed=42.

# Try it yourself: correlation impact on interaction power
from mcpower import MCPower

for r in [0.0, 0.30, 0.50]:
    model = MCPower("y ~ x1 + x2 + x1:x2")
    model.set_effects("x1=0.15, x2=0.15, x1:x2=0.10")
    if r > 0:
        model.set_correlations(f"corr(x1, x2)={r}")
    model.set_seed(42)
    model.set_simulations(1600)
    model.find_sample_size(
        from_size=50, to_size=3000, by=50,
        target_test="x1:x2",
    )

If your model has both main effects and interactions, correlation creates a trade-off: main effects need more observations while interactions need fewer. The sign of the correlation or the effects does not matter – only |r|.

Impact with Multiple Predictors

With three correlated predictors, the impact compounds:

Pairwise Correlation

Required N

Multiplier

All r=0.00

350

1.00x

All r=0.30

400

1.14x

All r=0.50

550

1.57x

Simulation: y ~ x1 + x2 + x3, all effects=0.15, 1600 simulations, seed=42.


Learn More