Tutorial: Correlated Predictors¶

Goal: Your predictors are correlated and you want to account for that in your power analysis.

Table of Contents¶

Full Working Example
Step-by-Step Walkthrough
Output Interpretation
Using Matrix Format
Correlated vs. Uncorrelated: A Comparison
Common Variations
Next Steps

Full Working Example¶

A social science researcher is studying predictors of health outcomes. The model includes income, education, and social support – variables that are correlated in the real world. Income and education are moderately correlated (r=0.50), and both are weakly correlated with social support.

from mcpower import MCPower

# 1. Define the model
model = MCPower("health = income + education + social_support")
model.set_simulations(400)

# 2. Set effect sizes
model.set_effects("income=0.30, education=0.25, social_support=0.40")

# 3. Set pairwise correlations between predictors
model.set_correlations(
    "(income, education)=0.50, "
    "(income, social_support)=0.20, "
    "(education, social_support)=0.30"
)

# 4. Find the required sample size
model.find_sample_size(
    target_test="all",
    from_size=50,
    to_size=400,
    by=40,
)

Effects: income=0.3, education=0.25, social_support=0.4
Correlations: 3 set
Model settings applied successfully

================================================================================
SAMPLE SIZE ANALYSIS RESULTS
================================================================================

Sample Size Requirements:
Test                                     Required N  
-----------------------------------------------------
overall                                  50          
income                                   130         
education                                210         
social_support                           90          

Step-by-Step Walkthrough¶

Lines 1-2: Model and effects¶

model = MCPower("health = income + education + social_support")
model.set_simulations(400)
model.set_effects("income=0.30, education=0.25, social_support=0.40")

A three-predictor model. All variables are continuous by default:

income=0.30 – a medium effect
education=0.25 – a medium effect
social_support=0.40 – a large effect

Line 3: Set correlations (string format)¶

model.set_correlations(
    "(income, education)=0.50, "
    "(income, social_support)=0.20, "
    "(education, social_support)=0.30"
)

This specifies three pairwise correlations:

Pair	Correlation	Meaning
income, education	0.50	Moderately correlated (people with more education tend to earn more)
income, social_support	0.20	Weakly correlated
education, social_support	0.30	Weakly-to-moderately correlated

The shorthand (x, y)=r is equivalent to the full form corr(x, y)=r. Any pair not specified defaults to 0 (independent).

MCPower uses Cholesky decomposition to generate data that matches this correlation structure. It also validates that the resulting correlation matrix is positive semi-definite – contradictory correlations (e.g., A-B=0.9, A-C=0.9, B-C=-0.9) will raise an error.

Line 4: Find sample sizes¶

model.find_sample_size(
    target_test="all",
    from_size=50,
    to_size=400,
    by=40,
)

With target_test="all", MCPower reports the minimum sample size for each individual predictor and the overall F-test.

Output Interpretation¶

Sample Size Requirements:
Test                                     Required N
-----------------------------------------------------
overall                                  50
income                                   130
education                                180
social_support                           70

Key observations:

education requires the largest N (180) despite having a similar effect size to income (0.25 vs. 0.30). This is because education shares substantial variance with income (r=0.50), leaving less unique variance to detect its independent effect.
social_support requires the smallest N (70) because it has the largest effect size (0.40) and weaker correlations with the other predictors.
The overall F-test requires only 50 participants because it tests whether the model as a whole explains significant variance. This is almost always more powerful than individual coefficient tests.
To power all individual tests at 80%, you need N=180 – the maximum across all effects.

Using Matrix Format¶

For models with many predictors, a numpy correlation matrix can be more convenient:

import numpy as np
from mcpower import MCPower

model = MCPower("health = income + education + social_support")
model.set_simulations(400)
model.set_effects("income=0.30, education=0.25, social_support=0.40")

# Correlation matrix: rows/columns in formula order
corr_matrix = np.array([
    [1.00, 0.50, 0.20],   # income
    [0.50, 1.00, 0.30],   # education
    [0.20, 0.30, 1.00],   # social_support
])
model.set_correlations(corr_matrix)

model.find_sample_size(
    target_test="all",
    from_size=50,
    to_size=400,
    by=40,
)

Effects: income=0.3, education=0.25, social_support=0.4
Correlation matrix set
Model settings applied successfully

================================================================================
SAMPLE SIZE ANALYSIS RESULTS
================================================================================

Sample Size Requirements:
Test                                     Required N  
-----------------------------------------------------
overall                                  50          
income                                   130         
education                                210         
social_support                           90          

The matrix must be:

Square – rows and columns match the number of non-factor predictors
Symmetric – corr_matrix[i,j] == corr_matrix[j,i]
Diagonal = 1.0 – each variable correlates perfectly with itself
Positive semi-definite – the matrix represents a valid correlation structure
Ordered by formula – columns correspond to predictors in the order they appear in the formula (here: income, education, social_support)

Correlated vs. Uncorrelated: A Comparison¶

To see the impact of correlations, compare the same model with and without them:

from mcpower import MCPower

# --- Without correlations (unrealistic but illustrative) ---
model_uncorr = MCPower("health = income + education + social_support")
model_uncorr.set_simulations(400)
model_uncorr.set_effects("income=0.30, education=0.25, social_support=0.40")
# No set_correlations call -- all correlations default to 0

model_uncorr.find_sample_size(
    target_test="all",
    from_size=50,
    to_size=400,
    by=40,
)

# --- With correlations (realistic) ---
model_corr = MCPower("health = income + education + social_support")
model_corr.set_simulations(400)
model_corr.set_effects("income=0.30, education=0.25, social_support=0.40")
model_corr.set_correlations(
    "(income, education)=0.50, "
    "(income, social_support)=0.20, "
    "(education, social_support)=0.30"
)

model_corr.find_sample_size(
    target_test="all",
    from_size=50,
    to_size=400,
    by=40,
)

Effects: income=0.3, education=0.25, social_support=0.4
Model settings applied successfully

================================================================================
SAMPLE SIZE ANALYSIS RESULTS
================================================================================

Sample Size Requirements:
Test                                     Required N  
-----------------------------------------------------
overall                                  50          
income                                   130         
education                                170         
social_support                           90          
Effects: income=0.3, education=0.25, social_support=0.4
Correlations: 3 set
Model settings applied successfully

================================================================================
SAMPLE SIZE ANALYSIS RESULTS
================================================================================

Sample Size Requirements:
Test                                     Required N  
-----------------------------------------------------
overall                                  50          
income                                   130         
education                                210         
social_support                           90          

Typical results:

Test	Min N (uncorrelated)	Min N (correlated)	Increase
income	100	130	+30%
education	140	180	+29%
social_support	60	70	+17%
overall	50	50	0%

The more strongly two predictors are correlated, the more each one’s required sample size increases. Income and education are the most correlated pair (r=0.50), and both see the largest increases.

Key takeaway: Ignoring predictor correlations overestimates power. You end up with an underpowered study.

Common Variations¶

Negative correlations¶

model.set_correlations("(stress, coping)=-0.40")

Negative correlations are common in psychology (e.g., stress and coping strategies). They affect power the same way – any non-zero correlation reduces power for the correlated predictors.

Correlation with a binary variable¶

model = MCPower("outcome = treatment + age + severity")
model.set_simulations(400)
model.set_variable_type("treatment=binary")
model.set_effects("treatment=0.50, age=0.20, severity=0.30")
model.set_correlations("(age, severity)=0.40, (treatment, age)=0.10")

Correlations between continuous and binary variables work the same way. Here, age and severity are moderately correlated, and treatment assignment has a slight correlation with age (perhaps older patients were more likely to accept the treatment).

Combine correlations with scenario analysis¶

model.set_correlations("(income, education)=0.50")

model.find_sample_size(
    target_test="income",
    from_size=50,
    to_size=400,
    by=40,
    scenarios=True,
)

Effects: income=0.3, education=0.25, social_support=0.4
Correlations: 1 set
Model settings applied successfully

================================================================================
SCENARIO-BASED MONTE CARLO POWER ANALYSIS RESULTS
================================================================================

================================================================================
SCENARIO SUMMARY
================================================================================

Uncorrected Sample Sizes:
Test                                     Optimistic   Realistic    Doomer      
-------------------------------------------------------------------------------
income                                   130          250          370         
================================================================================

Scenario analysis adds correlation noise on top of your specified correlations, simulating the possibility that your correlation estimates are slightly off. This provides a more robust sample size estimate.

Correlations with uploaded data¶

If you have pilot data, you can let MCPower compute correlations directly from the data:

import pandas as pd

data = pd.read_csv("pilot_data.csv")

model = MCPower("health = income + education + social_support")
model.upload_data(data[["income", "education", "social_support"]])
# In "strict" mode (the default), correlations are preserved via bootstrapping
# In "partial" mode, correlations are computed and can be overridden

model.set_effects("income=0.30, education=0.25, social_support=0.40")
model.find_sample_size(target_test="all", from_size=50, to_size=400, by=10)

See Uploading Data for details on correlation preservation modes.

Check if correlations are causing problems¶

If power is unexpectedly low, check whether high correlations are the cause by running the same analysis without correlations and comparing the results. If the gap is large, consider whether you truly need both correlated predictors in the model, or whether one could be dropped.

Next Steps¶

Tutorial: Your First Power Analysis – The basics of find_power
Tutorial: Finding the Right Sample Size – Systematic sample size search
Tutorial: Testing Interactions – Interactions between correlated predictors
Correlations – Full reference on correlation specification, matrix format, and limitations
Uploading Data – Using empirical data with preserved correlations
Scenario Analysis – Correlation noise and robustness testing
API Reference – Full parameter documentation for set_correlations