> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://nemo-platform.docs.buildwithfern.com/nemo/platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://nemo-platform.docs.buildwithfern.com/nemo/platform/_mcp/server.

# Differential Privacy Tutorial

<a id="tutorial-differential-privacy" />

Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial explores the privacy-utility tradeoff and demonstrates how to configure differential privacy parameters for optimal results.

If you have not yet completed the [Safe Synthesizer 101](/documentation/synthesize-safe-data/tutorials/safe-synthesizer-101) tutorial, consider starting there first.

## Prerequisites

* Understanding of [differential privacy](/documentation/synthesize-safe-data/about/data-synthesis)
* Safe Synthesizer deployment with GPU resources

***

## What You'll Learn

* Understanding differential privacy concepts (epsilon, delta)
* Configuring privacy hyperparameters
* Analyzing privacy-utility tradeoffs
* Interpreting privacy metrics in evaluation reports

***

## Understanding Differential Privacy

Differential privacy (DP) provides mathematical guarantees that synthetic data doesn't reveal information about individual records in the training data.

**Key Concepts:**

* **Epsilon (ε)**: Privacy budget - lower values mean stronger privacy
  * ε = 1: Very strong privacy
  * ε = 6-10: Moderate privacy
  * ε > 10: Weak privacy
  * **Recommended starting range: ε ∈ \[8, 12]** - adjust downward based on privacy needs

* **Delta (δ)**: Probability of privacy breach
  * Typically set to 1/n^1.2 where n is dataset size
  * Use `"auto"` for automatic calculation (recommended)
  * Manual values typically between 1e-6 and 1e-4

* **Noise**: Random noise added during training to prevent memorization
  * Calibrated based on epsilon, delta, and gradient clipping threshold
  * Higher privacy (lower epsilon) requires more noise

### Record-Level vs Group-Level Privacy

By default, NeMo Safe Synthesizer uses **record-level** differential privacy, which protects individual records. For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), you can use **group-level** privacy by setting `group_training_examples_by` to the column that identifies each entity. See [Group-Level Privacy](#group-level-privacy) in the Advanced Configuration section for a code example.

**When to use group-level privacy:**

* Multiple records per person/entity in your dataset
* Privacy guarantees should apply to entire entities, not individual records
* Examples: patient medical histories, customer transaction logs

***

## Setup

Install the NeMo Platform SDK with Safe Synthesizer support:

```shell
if command -v uv &> /dev/null; then
 uv pip install nemo-platform[all] kagglehub matplotlib
else
 pip install nemo-platform[all] kagglehub matplotlib
fi
```

```python
import os
import pandas as pd
from nemo_platform import NeMoPlatform
from nemo_safe_synthesizer_plugin.sdk.job_builder import SafeSynthesizerJobBuilder

# Configure client
client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
```

***

## Load and Prepare Data

```python
# Load sample dataset
import kagglehub  # type: ignore[import-not-found]

path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)

print(f"Dataset size: {len(df)} records")
print(f"Recommended delta: {1 / (len(df) ** 2):.2e}")
```

***

## Experiment 1: No Differential Privacy (Baseline)

First, create a baseline without differential privacy:

```python
import time

print("🔬 Experiment 1: No Differential Privacy (Baseline)")

builder_baseline = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .synthesize()
)

# Create a project for our jobs (creates if it doesn't exist)
project_name = "test-project"
try:
    client.projects.create(workspace="default", name=project_name)
except Exception:
    pass  # Project may already exist

job_baseline = builder_baseline.create_job(
    name=f"dp-baseline-{int(time.time())}", project="test-project"
)
print(f"✅ Baseline job created: {job_baseline.job_name}")

job_baseline.wait_for_completion()
summary_baseline = job_baseline.fetch_summary()

print(f"\n📊 Baseline Results:")
print(f" SQS (Quality): {summary_baseline.synthetic_data_quality_score}")
print(f" DPS (Privacy): {summary_baseline.data_privacy_score}")
```

***

## Experiment 2: Moderate Privacy (ε=6)

Apply moderate differential privacy:

```python
print("\n🔬 Experiment 2: Moderate Privacy (ε=6)")

builder_moderate = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .with_differential_privacy(epsilon=6.0, delta=1e-5)
    .synthesize()
)

job_moderate = builder_moderate.create_job(
    name=f"dp-moderate-{int(time.time())}", project="test-project"
)
print(f"✅ Moderate privacy job created: {job_moderate.job_name}")

job_moderate.wait_for_completion()
summary_moderate = job_moderate.fetch_summary()

print(f"\n📊 Moderate Privacy Results:")
print(f" SQS (Quality): {summary_moderate.synthetic_data_quality_score}")
print(f" DPS (Privacy): {summary_moderate.data_privacy_score}")
```

***

## Experiment 3: Strong Privacy (ε=1)

Apply strong differential privacy:

```python
print("\n🔬 Experiment 3: Strong Privacy (ε=1)")

builder_strong = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .with_differential_privacy(epsilon=1.0, delta=1e-5)
    .synthesize()
)

job_strong = builder_strong.create_job(
    name=f"dp-strong-{int(time.time())}", project="test-project"
)
print(f"✅ Strong privacy job created: {job_strong.job_name}")

job_strong.wait_for_completion()
summary_strong = job_strong.fetch_summary()

print(f"\n📊 Strong Privacy Results:")
print(f" SQS (Quality): {summary_strong.synthetic_data_quality_score}")
print(f" DPS (Privacy): {summary_strong.data_privacy_score}")
```

***

## Compare Results

Visualize the privacy-utility tradeoff:

```python
import matplotlib.pyplot as plt

experiments = ["Baseline\n(No DP)", "Moderate\n(ε=6)", "Strong\n(ε=1)"]
sqs_scores = [
    summary_baseline.synthetic_data_quality_score,
    summary_moderate.synthetic_data_quality_score,
    summary_strong.synthetic_data_quality_score,
]
dps_scores = [
    summary_baseline.data_privacy_score,
    summary_moderate.data_privacy_score,
    summary_strong.data_privacy_score,
]


def _safe_scores(scores):
    """Replace None values with 0 so matplotlib and format strings don't error."""
    return [s if s is not None else 0 for s in scores]


safe_sqs = _safe_scores(sqs_scores)
safe_dps = _safe_scores(dps_scores)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# SQS comparison
ax1.bar(experiments, safe_sqs, color=["blue", "green", "red"], alpha=0.7)
ax1.set_ylabel("Score")
ax1.set_title("Synthetic Quality Score (SQS)")
ax1.set_ylim([0, 100])
ax1.axhline(y=70, color="gray", linestyle="--", label="Good threshold")
ax1.legend()

# DPS comparison
ax2.bar(experiments, safe_dps, color=["blue", "green", "red"], alpha=0.7)
ax2.set_ylabel("Score")
ax2.set_title("Data Privacy Score (DPS)")
ax2.set_ylim([0, 100])
ax2.axhline(y=70, color="gray", linestyle="--", label="Good threshold")
ax2.legend()

plt.tight_layout()
plt.show()

print("\n📈 Privacy-Utility Tradeoff Summary:")
print(f"{'Experiment':<20} {'SQS (Utility)':<15} {'DPS (Privacy)':<15}")
print("-" * 50)
for i, exp in enumerate(experiments):
    sqs_val = f"{sqs_scores[i]:<15.1f}" if sqs_scores[i] is not None else "N/A "
    dps_val = f"{dps_scores[i]:<15.1f}" if dps_scores[i] is not None else "N/A "
    print(f"{exp.strip():<20} {sqs_val} {dps_val}")
```

***

## Advanced Configuration

### Custom Privacy Budget

Configure differential privacy with custom parameters:

```python
from nemo_safe_synthesizer_plugin.sdk.config import DifferentialPrivacyHyperparams

# Create custom privacy configuration
privacy_config = DifferentialPrivacyHyperparams(
    dp_enabled=True,
    epsilon=3.0,
    delta=1e-5,
    per_sample_max_grad_norm=1.0,  # Gradient clipping threshold
)

# Use with SafeSynthesizerJobBuilder
builder_custom = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .with_differential_privacy(config=privacy_config)
    .synthesize()
)
```

### Group-Level Privacy

For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), group-level privacy protects entire entities rather than individual records:

```python
# Group-level privacy for multi-record entities
builder_grouped = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_train(
        group_training_examples_by="patient_id"  # Group records by patient
    )
    .with_differential_privacy(epsilon=8.0)
    .synthesize()
)
```

### Privacy Budget Composition

When running multiple experiments, the privacy budget compounds:

```python
# Total privacy budget across experiments
total_epsilon = 0.0  # No DP baseline
total_epsilon += 6.0  # Moderate privacy
total_epsilon += 1.0  # Strong privacy

print(f"\n🔐 Total Privacy Budget Consumed: ε = {total_epsilon}")
print("Note: Each additional release compounds the privacy budget")
print("Best practice: Only release one synthetic dataset per original dataset")
```

***

## Interpreting Privacy Metrics

### Membership Inference Attack (MIA)

Measures if an attacker can determine whether a record was in training data:

```python
# Fetch detailed evaluation reports
baseline_report = job_baseline.fetch_summary()
moderate_report = job_moderate.fetch_summary()

print("\n🛡️ Membership Inference Protection:")
print(f"Baseline: {baseline_report.membership_inference_protection_score}")
print(f"Moderate (ε=6): {moderate_report.membership_inference_protection_score}")

print("\nInterpretation:")
print("- Higher score = Better protection")
print("- Score > 0.5 means attacker cannot reliably identify training records")
```

### Attribute Inference Attack (AIA)

Measures if sensitive attributes can be inferred from other attributes:

```python
print("\n🔍 Attribute Inference Protection:")
print(f"Baseline: {baseline_report.attribute_inference_protection_score}")
print(f"Moderate (ε=6): {moderate_report.attribute_inference_protection_score}")

print("\nInterpretation:")
print("- Higher score = Better protection")
print("- Measures difficulty of inferring sensitive values from known attributes")
```

***

## Best Practices

### Data Size Requirements

Differential privacy works best with larger datasets:

```python
def check_data_requirements(dataset_size):
 """Check if dataset size is suitable for DP."""
 print(f"📏 Dataset Size Analysis: {dataset_size} records")

 if dataset_size >= 10000:
 print("✅ Excellent - Dataset size is ideal for DP")
 print(" Expected: Good quality with ε ∈ [8, 12]")
 elif dataset_size >= 5000:
 print("⚠️ Moderate - Dataset may work with DP")
 print(" Recommendation: Start with higher epsilon (ε=10-12)")
 else:
 print("❌ Small - DP may significantly reduce quality")
 print(" Consider: Collecting more data or using DP without")

 print(f"\n Recommended delta: {1 / (dataset_size ** 1.2):.2e}")

check_data_requirements(len(df))
```

**Guidelines:**

* **10,000+ records**: Ideal for differential privacy
* **5,000-10,000 records**: May work, use higher epsilon
* **\< 5,000 records**: Consider quality trade-offs carefully

### Choosing Epsilon

```python
def recommend_epsilon(dataset_size, sensitivity):
    """
    Recommend epsilon based on dataset characteristics.

    Args:
    dataset_size: Number of records
    sensitivity: 'high' for medical/financial, 'medium' for general, 'low' for public
    """
    recommendations = {"high": (1.0, 3.0), "medium": (3.0, 6.0), "low": (6.0, 10.0)}

    epsilon_range = recommendations[sensitivity]
    delta = 1 / (dataset_size**1.2)

    print(f"📋 Recommendations for {dataset_size} records, {sensitivity} sensitivity:")
    print(f" Epsilon range: {epsilon_range[0]} - {epsilon_range[1]}")
    print(f" Delta: {delta:.2e}")
    print(f" Stronger privacy: Use lower epsilon within range")
    print(f" Better utility: Use higher epsilon within range")
    print(f"\n Starting point: ε = {(epsilon_range[0] + epsilon_range[1]) / 2:.1f}")


recommend_epsilon(len(df), "medium")
```

**Explicit Epsilon Guidance:**

* Start at **ε ∈ \[8, 12]** for most use cases
* Reduce epsilon gradually if stronger privacy is required
* Monitor SQS scores to understand quality impact
* Delta calculation: use `"auto"` or `1/n^1.2` where n is dataset size

### Training Optimization

Differential privacy training requires special considerations:

```python
# Optimal DP training configuration
builder_optimized = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_train(
        batch_size=256,  # Larger batch sizes benefit DP
        num_epochs=10,  # May need more epochs for convergence
    )
    .with_differential_privacy(epsilon=8.0, delta="auto", per_sample_max_grad_norm=1.0)
    .synthesize()
)
```

**Training Tips:**

1. **Use larger batch sizes** - DP benefits from larger batches (reduces noise variance)
   * Default batch size may be too small for optimal DP training
   * Try batch\_size=256 or 512 if GPU memory allows
   * If memory errors occur, reduce batch size gradually

2. **Monitor convergence** - DP training may converge differently
   * Watch training and validation loss
   * May require more epochs than non-DP training
   * Lower learning rate if training is unstable

3. **Adjust gradient clipping** - Controls sensitivity bound
   * `per_sample_max_grad_norm=1.0` is a good default
   * Lower values (0.5) = stronger clipping, more privacy, potentially lower quality
   * Higher values (1.5) = less clipping, less privacy, potentially better quality

### Privacy Budget Management

1. **Single Release**: Only release one synthetic dataset per original dataset
2. **Composition**: If multiple releases needed, divide privacy budget accordingly
3. **Documentation**: Track all data releases and cumulative privacy budget
4. **Renewal**: Privacy budget doesn't reset - consider this in data lifecycle
5. **Testing**: Test with higher epsilon before final release with lower epsilon

***

## Troubleshooting

### Low SQS with DP Enabled

If synthetic quality drops significantly:

```python
# Try these approaches:
# 1. Increase epsilon (reduce privacy slightly)
# 2. Increase training data size
# 3. Increase training epochs
# 4. Adjust per_sample_max_grad_norm for gradient clipping

builder_improved = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_train(
        num_epochs=10  # More training
    )
    .with_replace_pii()
    .with_differential_privacy(
        epsilon=6.0,  # Slightly higher
        delta=1e-5,
        per_sample_max_grad_norm=1.5,  # Less aggressive clipping
    )
    .synthesize()
)
```