# Tutorial: Preparing Your CSV for MCPower

## Goal

You have data in Excel, Google Sheets, SPSS, or another tool, and you want to export it as a CSV file that MCPower can read without errors.

This page covers:
1. How to **export a CSV** from common tools
2. What **format rules** your CSV must follow
3. How MCPower **auto-detects** variable types from your data

---

## Creating Your CSV File

### From Excel

1. Open your spreadsheet in Excel
2. Make sure your data starts in row 1 with column headers (no blank rows above)
3. Remove any merged cells, formulas, or special formatting
4. Go to **File > Save As**
5. Choose **CSV UTF-8 (Comma delimited) (*.csv)** as the file type
6. Click **Save**

> **Tip:** If you only see "CSV (Comma delimited)" without "UTF-8", that older format usually works too, but UTF-8 is safer for special characters (accents, non-English text).

### From Google Sheets

1. Open your spreadsheet in Google Sheets
2. Make sure your data starts in row 1 with column headers
3. Go to **File > Download > Comma Separated Values (.csv)**
4. The file is saved to your Downloads folder

Google Sheets exports as UTF-8 with comma separators by default — no extra settings needed.

### From SPSS

1. Open your dataset in SPSS
2. Go to **File > Save As**
3. Set the file type to **Comma Separated Values (*.csv)**
4. Check **"Write variable names to spreadsheet"** so the header row is included
5. Click **Save**

Alternatively, use SPSS syntax:
```
SAVE TRANSLATE
  /OUTFILE="pilot_data.csv"
  /TYPE=CSV
  /ENCODING="UTF8"
  /FIELDNAMES
  /REPLACE.
```

> **Note:** SPSS may encode missing values as `.` (dot). MCPower does not accept missing values — remove or impute them before exporting (see [Step 4](#4-no-missing-values-or-empty-cells) below).

---

## Full Working Example

A correctly formatted CSV file (`pilot_data.csv`):

```text
treatment,score,age,education
1,72.5,34,high_school
0,68.1,29,college
1,81.3,41,graduate
0,65.9,26,college
1,77.2,38,high_school
0,70.4,31,graduate
1,84.6,45,college
0,62.3,24,high_school
```

Loading and using it in MCPower:

```python
import pandas as pd
from mcpower import MCPower

data = pd.read_csv("pilot_data.csv")

model = MCPower("outcome = treatment + score + age + education")
model.upload_data(data[["treatment", "score", "age", "education"]])
# treatment: 2 unique values → binary
# score: 8 unique values → continuous
# age: 8 unique values → continuous
# education: 3 unique values ["college", "graduate", "high_school"] → factor
#   Reference: "college" (first alphabetically)
#   Dummies: education[graduate], education[high_school]

model.set_effects(
    "treatment=0.50, score=0.25, age=0.10, "
    "education[graduate]=0.40, education[high_school]=0.20"
)
model.find_power(sample_size=200)
```

---

## Step-by-Step Walkthrough

### 1. Use comma separators

MCPower expects standard comma-separated values. Do not use semicolons, tabs, or pipes.

**Correct:**
```text
group,score,age
A,85.2,34
B,72.1,28
```

**Wrong:**
```text
group;score;age
A;85.2;34
B;72.1;28
```

If your file uses a different delimiter, convert it first:

```python
# Read with semicolon delimiter, then save as comma-separated
data = pd.read_csv("data.csv", sep=";")
data.to_csv("data_fixed.csv", index=False)
```

### 2. Include a header row

The first row must contain column names. These names become the variable names in your formula.

**Correct:**
```text
treatment,score,age
1,72.5,34
0,68.1,29
```

**Wrong (no header):**
```text
1,72.5,34
0,68.1,29
```

### 3. Keep types consistent within each column

Every value in a column should be the same type -- all numbers or all strings. Do not mix.

**Correct:**
```text
group,score
control,72.5
treatment,81.3
control,68.1
```

**Wrong (mixed types in the group column):**
```text
group,score
control,72.5
1,81.3
treatment,68.1
```

### 4. No missing values or empty cells

MCPower does not handle missing data. Remove or impute incomplete rows before uploading.

**Correct:**
```text
treatment,score,age
1,72.5,34
0,68.1,29
1,81.3,41
```

**Wrong (missing values):**
```text
treatment,score,age
1,72.5,34
0,,29
1,81.3,
```

**Wrong (NA or NULL markers):**
```text
treatment,score,age
1,72.5,34
0,NA,29
1,81.3,NULL
```

Clean your data in pandas before uploading:

```python
data = pd.read_csv("data_with_missing.csv")
data = data.dropna(subset=["treatment", "score", "age"])  # drop rows with NAs
```

### 5. Save with UTF-8 encoding

Save your CSV as UTF-8. pandas tries `utf-8-sig` first (MCPower does not read CSV files directly), but UTF-8 is the safest choice.

In Excel: **File > Save As > CSV UTF-8 (Comma delimited)**

In R:
```r
write.csv(data, "pilot_data.csv", row.names = FALSE, fileEncoding = "UTF-8")
```

### 6. No trailing commas

Ensure rows do not end with extra commas. This is a common issue when exporting from spreadsheet software.

**Correct:**
```text
group,score,age
A,85.2,34
B,72.1,28
```

**Wrong (trailing commas create a phantom empty column):**
```text
group,score,age,
A,85.2,34,
B,72.1,28,
```

### 7. Use simple column names

Stick to letters, numbers, and underscores. Avoid spaces, special characters, and names that start with numbers.

**Good names:** `treatment`, `score`, `baseline_pain`, `group_A`, `age`

**Problematic names:** `treatment group`, `score (%)`, `1st_measure`, `age+1`

---

## How Auto-Detection Works

MCPower classifies each column based on the count of unique values:

| Unique Values | Detected Type | Example |
|---|---|---|
| 1 | Dropped with warning | A column that is all `1` |
| 2 | Binary | `treatment` with values `0, 1` |
| 3–6 | Factor | `education` with values `"high_school", "college", "graduate"` |
| 7+ | Continuous | `age` with values `24, 26, 29, 31, 34, 38, 41, 45` |

**String columns** with 2–20 unique values are auto-detected as factors. String columns with more than 20 unique values raise an error.

If auto-detection gets it wrong, override with `data_types`:

```python
# "rating" has 5 unique values, auto-detected as factor
# Override to continuous
model.upload_data(data, data_types={"rating": "continuous"})
```

---

## Quick Checklist

Before uploading your CSV, verify:

- [ ] Comma-separated (not semicolons or tabs)
- [ ] Header row present with column names
- [ ] Consistent types per column (all numbers or all strings)
- [ ] No missing values, empty cells, `NA`, or `NULL`
- [ ] UTF-8 encoding
- [ ] No trailing commas
- [ ] Column names use only letters, numbers, and underscores

---

## Common Pitfalls

| Problem | Symptom | Fix |
|---|---|---|
| Semicolons instead of commas | All data lands in one column | Re-export with comma separator |
| Missing header | First data row used as names | Add a header row |
| Mixed types in a column | Unexpected type detection | Clean column to one type |
| Empty cells or NA values | Error during upload | Drop or impute missing rows |
| Trailing commas | Phantom empty column detected | Remove trailing commas |
| BOM encoding marker | Column name has invisible prefix | Save as UTF-8 (not UTF-8 with BOM), or use `utf-8-sig` encoding |
| Too many unique strings (>20) | ValueError on upload | Override with `data_types={"col": "continuous"}` or reduce categories |

---

## Next Steps

- **[Tutorial: Power Analysis with Your Own Data](own-data.md)** — using your uploaded data in a full analysis
- **[`upload_data()` API Reference](../api/data.md)** — complete parameter documentation