Differential Privacy Tutorial

View as Markdown

Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial explores the privacy-utility tradeoff and demonstrates how to configure differential privacy parameters for optimal results.

If you have not yet completed the Safe Synthesizer 101 tutorial, consider starting there first.

Prerequisites


What You’ll Learn

  • Understanding differential privacy concepts (epsilon, delta)
  • Configuring privacy hyperparameters
  • Analyzing privacy-utility tradeoffs
  • Interpreting privacy metrics in evaluation reports

Understanding Differential Privacy

Differential privacy (DP) provides mathematical guarantees that synthetic data doesn’t reveal information about individual records in the training data.

Key Concepts:

  • Epsilon (ε): Privacy budget - lower values mean stronger privacy

    • ε = 1: Very strong privacy
    • ε = 6-10: Moderate privacy
    • ε > 10: Weak privacy
    • Recommended starting range: ε ∈ [8, 12] - adjust downward based on privacy needs
  • Delta (δ): Probability of privacy breach

    • Typically set to 1/n^1.2 where n is dataset size
    • Use "auto" for automatic calculation (recommended)
    • Manual values typically between 1e-6 and 1e-4
  • Noise: Random noise added during training to prevent memorization

    • Calibrated based on epsilon, delta, and gradient clipping threshold
    • Higher privacy (lower epsilon) requires more noise

Record-Level vs Group-Level Privacy

By default, NeMo Safe Synthesizer uses record-level differential privacy, which protects individual records. For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), you can use group-level privacy by setting group_training_examples_by to the column that identifies each entity. See Group-Level Privacy in the Advanced Configuration section for a code example.

When to use group-level privacy:

  • Multiple records per person/entity in your dataset
  • Privacy guarantees should apply to entire entities, not individual records
  • Examples: patient medical histories, customer transaction logs

Setup

Install the NeMo Platform SDK with Safe Synthesizer support:

$if command -v uv &> /dev/null; then
$ uv pip install nemo-platform[all] kagglehub matplotlib
$else
$ pip install nemo-platform[all] kagglehub matplotlib
$fi
1import os
2import pandas as pd
3from nemo_platform import NeMoPlatform
4from nemo_safe_synthesizer_plugin.sdk.job_builder import SafeSynthesizerJobBuilder
5
6# Configure client
7client = NeMoPlatform(
8 base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
9 workspace="default",
10)

Load and Prepare Data

1# Load sample dataset
2import kagglehub # type: ignore[import-not-found]
3
4path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
5df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
6
7print(f"Dataset size: {len(df)} records")
8print(f"Recommended delta: {1 / (len(df) ** 2):.2e}")

Experiment 1: No Differential Privacy (Baseline)

First, create a baseline without differential privacy:

1import time
2
3print("🔬 Experiment 1: No Differential Privacy (Baseline)")
4
5builder_baseline = (
6 SafeSynthesizerJobBuilder(client)
7 .with_data_source(df)
8 .with_replace_pii()
9 .synthesize()
10)
11
12# Create a project for our jobs (creates if it doesn't exist)
13project_name = "test-project"
14try:
15 client.projects.create(workspace="default", name=project_name)
16except Exception:
17 pass # Project may already exist
18
19job_baseline = builder_baseline.create_job(
20 name=f"dp-baseline-{int(time.time())}", project="test-project"
21)
22print(f"✅ Baseline job created: {job_baseline.job_name}")
23
24job_baseline.wait_for_completion()
25summary_baseline = job_baseline.fetch_summary()
26
27print(f"\n📊 Baseline Results:")
28print(f" SQS (Quality): {summary_baseline.synthetic_data_quality_score}")
29print(f" DPS (Privacy): {summary_baseline.data_privacy_score}")

Experiment 2: Moderate Privacy (ε=6)

Apply moderate differential privacy:

1print("\n🔬 Experiment 2: Moderate Privacy (ε=6)")
2
3builder_moderate = (
4 SafeSynthesizerJobBuilder(client)
5 .with_data_source(df)
6 .with_replace_pii()
7 .with_differential_privacy(epsilon=6.0, delta=1e-5)
8 .synthesize()
9)
10
11job_moderate = builder_moderate.create_job(
12 name=f"dp-moderate-{int(time.time())}", project="test-project"
13)
14print(f"✅ Moderate privacy job created: {job_moderate.job_name}")
15
16job_moderate.wait_for_completion()
17summary_moderate = job_moderate.fetch_summary()
18
19print(f"\n📊 Moderate Privacy Results:")
20print(f" SQS (Quality): {summary_moderate.synthetic_data_quality_score}")
21print(f" DPS (Privacy): {summary_moderate.data_privacy_score}")

Experiment 3: Strong Privacy (ε=1)

Apply strong differential privacy:

1print("\n🔬 Experiment 3: Strong Privacy (ε=1)")
2
3builder_strong = (
4 SafeSynthesizerJobBuilder(client)
5 .with_data_source(df)
6 .with_replace_pii()
7 .with_differential_privacy(epsilon=1.0, delta=1e-5)
8 .synthesize()
9)
10
11job_strong = builder_strong.create_job(
12 name=f"dp-strong-{int(time.time())}", project="test-project"
13)
14print(f"✅ Strong privacy job created: {job_strong.job_name}")
15
16job_strong.wait_for_completion()
17summary_strong = job_strong.fetch_summary()
18
19print(f"\n📊 Strong Privacy Results:")
20print(f" SQS (Quality): {summary_strong.synthetic_data_quality_score}")
21print(f" DPS (Privacy): {summary_strong.data_privacy_score}")

Compare Results

Visualize the privacy-utility tradeoff:

1import matplotlib.pyplot as plt
2
3experiments = ["Baseline\n(No DP)", "Moderate\n(ε=6)", "Strong\n(ε=1)"]
4sqs_scores = [
5 summary_baseline.synthetic_data_quality_score,
6 summary_moderate.synthetic_data_quality_score,
7 summary_strong.synthetic_data_quality_score,
8]
9dps_scores = [
10 summary_baseline.data_privacy_score,
11 summary_moderate.data_privacy_score,
12 summary_strong.data_privacy_score,
13]
14
15
16def _safe_scores(scores):
17 """Replace None values with 0 so matplotlib and format strings don't error."""
18 return [s if s is not None else 0 for s in scores]
19
20
21safe_sqs = _safe_scores(sqs_scores)
22safe_dps = _safe_scores(dps_scores)
23
24fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
25
26# SQS comparison
27ax1.bar(experiments, safe_sqs, color=["blue", "green", "red"], alpha=0.7)
28ax1.set_ylabel("Score")
29ax1.set_title("Synthetic Quality Score (SQS)")
30ax1.set_ylim([0, 100])
31ax1.axhline(y=70, color="gray", linestyle="--", label="Good threshold")
32ax1.legend()
33
34# DPS comparison
35ax2.bar(experiments, safe_dps, color=["blue", "green", "red"], alpha=0.7)
36ax2.set_ylabel("Score")
37ax2.set_title("Data Privacy Score (DPS)")
38ax2.set_ylim([0, 100])
39ax2.axhline(y=70, color="gray", linestyle="--", label="Good threshold")
40ax2.legend()
41
42plt.tight_layout()
43plt.show()
44
45print("\n📈 Privacy-Utility Tradeoff Summary:")
46print(f"{'Experiment':<20} {'SQS (Utility)':<15} {'DPS (Privacy)':<15}")
47print("-" * 50)
48for i, exp in enumerate(experiments):
49 sqs_val = f"{sqs_scores[i]:<15.1f}" if sqs_scores[i] is not None else "N/A "
50 dps_val = f"{dps_scores[i]:<15.1f}" if dps_scores[i] is not None else "N/A "
51 print(f"{exp.strip():<20} {sqs_val} {dps_val}")

Advanced Configuration

Custom Privacy Budget

Configure differential privacy with custom parameters:

1from nemo_safe_synthesizer_plugin.sdk.config import DifferentialPrivacyHyperparams
2
3# Create custom privacy configuration
4privacy_config = DifferentialPrivacyHyperparams(
5 dp_enabled=True,
6 epsilon=3.0,
7 delta=1e-5,
8 per_sample_max_grad_norm=1.0, # Gradient clipping threshold
9)
10
11# Use with SafeSynthesizerJobBuilder
12builder_custom = (
13 SafeSynthesizerJobBuilder(client)
14 .with_data_source(df)
15 .with_replace_pii()
16 .with_differential_privacy(config=privacy_config)
17 .synthesize()
18)

Group-Level Privacy

For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), group-level privacy protects entire entities rather than individual records:

1# Group-level privacy for multi-record entities
2builder_grouped = (
3 SafeSynthesizerJobBuilder(client)
4 .with_data_source(df)
5 .with_train(
6 group_training_examples_by="patient_id" # Group records by patient
7 )
8 .with_differential_privacy(epsilon=8.0)
9 .synthesize()
10)

Privacy Budget Composition

When running multiple experiments, the privacy budget compounds:

1# Total privacy budget across experiments
2total_epsilon = 0.0 # No DP baseline
3total_epsilon += 6.0 # Moderate privacy
4total_epsilon += 1.0 # Strong privacy
5
6print(f"\n🔐 Total Privacy Budget Consumed: ε = {total_epsilon}")
7print("Note: Each additional release compounds the privacy budget")
8print("Best practice: Only release one synthetic dataset per original dataset")

Interpreting Privacy Metrics

Membership Inference Attack (MIA)

Measures if an attacker can determine whether a record was in training data:

1# Fetch detailed evaluation reports
2baseline_report = job_baseline.fetch_summary()
3moderate_report = job_moderate.fetch_summary()
4
5print("\n🛡️ Membership Inference Protection:")
6print(f"Baseline: {baseline_report.membership_inference_protection_score}")
7print(f"Moderate (ε=6): {moderate_report.membership_inference_protection_score}")
8
9print("\nInterpretation:")
10print("- Higher score = Better protection")
11print("- Score > 0.5 means attacker cannot reliably identify training records")

Attribute Inference Attack (AIA)

Measures if sensitive attributes can be inferred from other attributes:

1print("\n🔍 Attribute Inference Protection:")
2print(f"Baseline: {baseline_report.attribute_inference_protection_score}")
3print(f"Moderate (ε=6): {moderate_report.attribute_inference_protection_score}")
4
5print("\nInterpretation:")
6print("- Higher score = Better protection")
7print("- Measures difficulty of inferring sensitive values from known attributes")

Best Practices

Data Size Requirements

Differential privacy works best with larger datasets:

1def check_data_requirements(dataset_size):
2 """Check if dataset size is suitable for DP."""
3 print(f"📏 Dataset Size Analysis: {dataset_size} records")
4
5 if dataset_size >= 10000:
6 print("✅ Excellent - Dataset size is ideal for DP")
7 print(" Expected: Good quality with ε ∈ [8, 12]")
8 elif dataset_size >= 5000:
9 print("⚠️ Moderate - Dataset may work with DP")
10 print(" Recommendation: Start with higher epsilon (ε=10-12)")
11 else:
12 print("❌ Small - DP may significantly reduce quality")
13 print(" Consider: Collecting more data or using DP without")
14
15 print(f"\n Recommended delta: {1 / (dataset_size ** 1.2):.2e}")
16
17check_data_requirements(len(df))

Guidelines:

  • 10,000+ records: Ideal for differential privacy
  • 5,000-10,000 records: May work, use higher epsilon
  • < 5,000 records: Consider quality trade-offs carefully

Choosing Epsilon

1def recommend_epsilon(dataset_size, sensitivity):
2 """
3 Recommend epsilon based on dataset characteristics.
4
5 Args:
6 dataset_size: Number of records
7 sensitivity: 'high' for medical/financial, 'medium' for general, 'low' for public
8 """
9 recommendations = {"high": (1.0, 3.0), "medium": (3.0, 6.0), "low": (6.0, 10.0)}
10
11 epsilon_range = recommendations[sensitivity]
12 delta = 1 / (dataset_size**1.2)
13
14 print(f"📋 Recommendations for {dataset_size} records, {sensitivity} sensitivity:")
15 print(f" Epsilon range: {epsilon_range[0]} - {epsilon_range[1]}")
16 print(f" Delta: {delta:.2e}")
17 print(f" Stronger privacy: Use lower epsilon within range")
18 print(f" Better utility: Use higher epsilon within range")
19 print(f"\n Starting point: ε = {(epsilon_range[0] + epsilon_range[1]) / 2:.1f}")
20
21
22recommend_epsilon(len(df), "medium")

Explicit Epsilon Guidance:

  • Start at ε ∈ [8, 12] for most use cases
  • Reduce epsilon gradually if stronger privacy is required
  • Monitor SQS scores to understand quality impact
  • Delta calculation: use "auto" or 1/n^1.2 where n is dataset size

Training Optimization

Differential privacy training requires special considerations:

1# Optimal DP training configuration
2builder_optimized = (
3 SafeSynthesizerJobBuilder(client)
4 .with_data_source(df)
5 .with_train(
6 batch_size=256, # Larger batch sizes benefit DP
7 num_epochs=10, # May need more epochs for convergence
8 )
9 .with_differential_privacy(epsilon=8.0, delta="auto", per_sample_max_grad_norm=1.0)
10 .synthesize()
11)

Training Tips:

  1. Use larger batch sizes - DP benefits from larger batches (reduces noise variance)

    • Default batch size may be too small for optimal DP training
    • Try batch_size=256 or 512 if GPU memory allows
    • If memory errors occur, reduce batch size gradually
  2. Monitor convergence - DP training may converge differently

    • Watch training and validation loss
    • May require more epochs than non-DP training
    • Lower learning rate if training is unstable
  3. Adjust gradient clipping - Controls sensitivity bound

    • per_sample_max_grad_norm=1.0 is a good default
    • Lower values (0.5) = stronger clipping, more privacy, potentially lower quality
    • Higher values (1.5) = less clipping, less privacy, potentially better quality

Privacy Budget Management

  1. Single Release: Only release one synthetic dataset per original dataset
  2. Composition: If multiple releases needed, divide privacy budget accordingly
  3. Documentation: Track all data releases and cumulative privacy budget
  4. Renewal: Privacy budget doesn’t reset - consider this in data lifecycle
  5. Testing: Test with higher epsilon before final release with lower epsilon

Troubleshooting

Low SQS with DP Enabled

If synthetic quality drops significantly:

1# Try these approaches:
2# 1. Increase epsilon (reduce privacy slightly)
3# 2. Increase training data size
4# 3. Increase training epochs
5# 4. Adjust per_sample_max_grad_norm for gradient clipping
6
7builder_improved = (
8 SafeSynthesizerJobBuilder(client)
9 .with_data_source(df)
10 .with_train(
11 num_epochs=10 # More training
12 )
13 .with_replace_pii()
14 .with_differential_privacy(
15 epsilon=6.0, # Slightly higher
16 delta=1e-5,
17 per_sample_max_grad_norm=1.5, # Less aggressive clipping
18 )
19 .synthesize()
20)