Differential Privacy Tutorial | NVIDIA NeMo Platform

Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial explores the privacy-utility tradeoff and demonstrates how to configure differential privacy parameters for optimal results.

If you have not yet completed the Safe Synthesizer 101 tutorial, consider starting there first.

Prerequisites

Understanding of differential privacy
Safe Synthesizer deployment with GPU resources

What You’ll Learn

Understanding differential privacy concepts (epsilon, delta)
Configuring privacy hyperparameters
Analyzing privacy-utility tradeoffs
Interpreting privacy metrics in evaluation reports

Understanding Differential Privacy

Differential privacy (DP) provides mathematical guarantees that synthetic data doesn’t reveal information about individual records in the training data.

Key Concepts:

Epsilon (ε): Privacy budget - lower values mean stronger privacy
- ε = 1: Very strong privacy
- ε = 6-10: Moderate privacy
- ε > 10: Weak privacy
- Recommended starting range: ε ∈ [8, 12] - adjust downward based on privacy needs
Delta (δ): Probability of privacy breach
- Typically set to 1/n^1.2 where n is dataset size
- Use "auto" for automatic calculation (recommended)
- Manual values typically between 1e-6 and 1e-4
Noise: Random noise added during training to prevent memorization
- Calibrated based on epsilon, delta, and gradient clipping threshold
- Higher privacy (lower epsilon) requires more noise

Record-Level vs Group-Level Privacy

By default, NeMo Safe Synthesizer uses record-level differential privacy, which protects individual records. For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), you can use group-level privacy by setting group_training_examples_by to the column that identifies each entity. See Group-Level Privacy in the Advanced Configuration section for a code example.

When to use group-level privacy:

Multiple records per person/entity in your dataset
Privacy guarantees should apply to entire entities, not individual records
Examples: patient medical histories, customer transaction logs

Setup

Install the NeMo Platform SDK with Safe Synthesizer support:

$ if command -v uv &> /dev/null; then
$  uv pip install nemo-platform[all] kagglehub matplotlib
$ else
$  pip install nemo-platform[all] kagglehub matplotlib
$ fi

1 import os
2 import pandas as pd
3 from nemo_platform import NeMoPlatform
4 from nemo_safe_synthesizer_plugin.sdk.job_builder import SafeSynthesizerJobBuilder
5 
6 # Configure client
7 client = NeMoPlatform(
8     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
9     workspace="default",
10 )

Load and Prepare Data

1 # Load sample dataset
2 import kagglehub  # type: ignore[import-not-found]
3 
4 path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
5 df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
6 
7 print(f"Dataset size: {len(df)} records")
8 print(f"Recommended delta: {1 / (len(df) ** 2):.2e}")

Experiment 1: No Differential Privacy (Baseline)

First, create a baseline without differential privacy:

1 import time
2 
3 print("🔬 Experiment 1: No Differential Privacy (Baseline)")
4 
5 builder_baseline = (
6     SafeSynthesizerJobBuilder(client)
7     .with_data_source(df)
8     .with_replace_pii()
9     .synthesize()
10 )
11 
12 # Create a project for our jobs (creates if it doesn't exist)
13 project_name = "test-project"
14 try:
15     client.projects.create(workspace="default", name=project_name)
16 except Exception:
17     pass  # Project may already exist
18 
19 job_baseline = builder_baseline.create_job(
20     name=f"dp-baseline-{int(time.time())}", project="test-project"
21 )
22 print(f"✅ Baseline job created: {job_baseline.job_name}")
23 
24 job_baseline.wait_for_completion()
25 summary_baseline = job_baseline.fetch_summary()
26 
27 print(f"\n📊 Baseline Results:")
28 print(f" SQS (Quality): {summary_baseline.synthetic_data_quality_score}")
29 print(f" DPS (Privacy): {summary_baseline.data_privacy_score}")

Experiment 2: Moderate Privacy (ε=6)

Apply moderate differential privacy:

1 print("\n🔬 Experiment 2: Moderate Privacy (ε=6)")
2 
3 builder_moderate = (
4     SafeSynthesizerJobBuilder(client)
5     .with_data_source(df)
6     .with_replace_pii()
7     .with_differential_privacy(epsilon=6.0, delta=1e-5)
8     .synthesize()
9 )
10 
11 job_moderate = builder_moderate.create_job(
12     name=f"dp-moderate-{int(time.time())}", project="test-project"
13 )
14 print(f"✅ Moderate privacy job created: {job_moderate.job_name}")
15 
16 job_moderate.wait_for_completion()
17 summary_moderate = job_moderate.fetch_summary()
18 
19 print(f"\n📊 Moderate Privacy Results:")
20 print(f" SQS (Quality): {summary_moderate.synthetic_data_quality_score}")
21 print(f" DPS (Privacy): {summary_moderate.data_privacy_score}")

Experiment 3: Strong Privacy (ε=1)

Apply strong differential privacy:

1 print("\n🔬 Experiment 3: Strong Privacy (ε=1)")
2 
3 builder_strong = (
4     SafeSynthesizerJobBuilder(client)
5     .with_data_source(df)
6     .with_replace_pii()
7     .with_differential_privacy(epsilon=1.0, delta=1e-5)
8     .synthesize()
9 )
10 
11 job_strong = builder_strong.create_job(
12     name=f"dp-strong-{int(time.time())}", project="test-project"
13 )
14 print(f"✅ Strong privacy job created: {job_strong.job_name}")
15 
16 job_strong.wait_for_completion()
17 summary_strong = job_strong.fetch_summary()
18 
19 print(f"\n📊 Strong Privacy Results:")
20 print(f" SQS (Quality): {summary_strong.synthetic_data_quality_score}")
21 print(f" DPS (Privacy): {summary_strong.data_privacy_score}")

Compare Results

Visualize the privacy-utility tradeoff:

1 import matplotlib.pyplot as plt
2 
3 experiments = ["Baseline\n(No DP)", "Moderate\n(ε=6)", "Strong\n(ε=1)"]
4 sqs_scores = [
5     summary_baseline.synthetic_data_quality_score,
6     summary_moderate.synthetic_data_quality_score,
7     summary_strong.synthetic_data_quality_score,
8 ]
9 dps_scores = [
10     summary_baseline.data_privacy_score,
11     summary_moderate.data_privacy_score,
12     summary_strong.data_privacy_score,
13 ]
14 
15 
16 def _safe_scores(scores):
17     """Replace None values with 0 so matplotlib and format strings don't error."""
18     return [s if s is not None else 0 for s in scores]
19 
20 
21 safe_sqs = _safe_scores(sqs_scores)
22 safe_dps = _safe_scores(dps_scores)
23 
24 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
25 
26 # SQS comparison
27 ax1.bar(experiments, safe_sqs, color=["blue", "green", "red"], alpha=0.7)
28 ax1.set_ylabel("Score")
29 ax1.set_title("Synthetic Quality Score (SQS)")
30 ax1.set_ylim([0, 100])
31 ax1.axhline(y=70, color="gray", linestyle="--", label="Good threshold")
32 ax1.legend()
33 
34 # DPS comparison
35 ax2.bar(experiments, safe_dps, color=["blue", "green", "red"], alpha=0.7)
36 ax2.set_ylabel("Score")
37 ax2.set_title("Data Privacy Score (DPS)")
38 ax2.set_ylim([0, 100])
39 ax2.axhline(y=70, color="gray", linestyle="--", label="Good threshold")
40 ax2.legend()
41 
42 plt.tight_layout()
43 plt.show()
44 
45 print("\n📈 Privacy-Utility Tradeoff Summary:")
46 print(f"{'Experiment':<20} {'SQS (Utility)':<15} {'DPS (Privacy)':<15}")
47 print("-" * 50)
48 for i, exp in enumerate(experiments):
49     sqs_val = f"{sqs_scores[i]:<15.1f}" if sqs_scores[i] is not None else "N/A "
50     dps_val = f"{dps_scores[i]:<15.1f}" if dps_scores[i] is not None else "N/A "
51     print(f"{exp.strip():<20} {sqs_val} {dps_val}")

Advanced Configuration

Custom Privacy Budget

Configure differential privacy with custom parameters:

1 from nemo_safe_synthesizer_plugin.sdk.config import DifferentialPrivacyHyperparams
2 
3 # Create custom privacy configuration
4 privacy_config = DifferentialPrivacyHyperparams(
5     dp_enabled=True,
6     epsilon=3.0,
7     delta=1e-5,
8     per_sample_max_grad_norm=1.0,  # Gradient clipping threshold
9 )
10 
11 # Use with SafeSynthesizerJobBuilder
12 builder_custom = (
13     SafeSynthesizerJobBuilder(client)
14     .with_data_source(df)
15     .with_replace_pii()
16     .with_differential_privacy(config=privacy_config)
17     .synthesize()
18 )

Group-Level Privacy

For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), group-level privacy protects entire entities rather than individual records:

1 # Group-level privacy for multi-record entities
2 builder_grouped = (
3     SafeSynthesizerJobBuilder(client)
4     .with_data_source(df)
5     .with_train(
6         group_training_examples_by="patient_id"  # Group records by patient
7     )
8     .with_differential_privacy(epsilon=8.0)
9     .synthesize()
10 )

Privacy Budget Composition

When running multiple experiments, the privacy budget compounds:

1 # Total privacy budget across experiments
2 total_epsilon = 0.0  # No DP baseline
3 total_epsilon += 6.0  # Moderate privacy
4 total_epsilon += 1.0  # Strong privacy
5 
6 print(f"\n🔐 Total Privacy Budget Consumed: ε = {total_epsilon}")
7 print("Note: Each additional release compounds the privacy budget")
8 print("Best practice: Only release one synthetic dataset per original dataset")

Interpreting Privacy Metrics

Membership Inference Attack (MIA)

Measures if an attacker can determine whether a record was in training data:

1 # Fetch detailed evaluation reports
2 baseline_report = job_baseline.fetch_summary()
3 moderate_report = job_moderate.fetch_summary()
4 
5 print("\n🛡️ Membership Inference Protection:")
6 print(f"Baseline: {baseline_report.membership_inference_protection_score}")
7 print(f"Moderate (ε=6): {moderate_report.membership_inference_protection_score}")
8 
9 print("\nInterpretation:")
10 print("- Higher score = Better protection")
11 print("- Score > 0.5 means attacker cannot reliably identify training records")

Attribute Inference Attack (AIA)

Measures if sensitive attributes can be inferred from other attributes:

1 print("\n🔍 Attribute Inference Protection:")
2 print(f"Baseline: {baseline_report.attribute_inference_protection_score}")
3 print(f"Moderate (ε=6): {moderate_report.attribute_inference_protection_score}")
4 
5 print("\nInterpretation:")
6 print("- Higher score = Better protection")
7 print("- Measures difficulty of inferring sensitive values from known attributes")

Best Practices

Data Size Requirements

Differential privacy works best with larger datasets:

1 def check_data_requirements(dataset_size):
2  """Check if dataset size is suitable for DP."""
3  print(f"📏 Dataset Size Analysis: {dataset_size} records")
4 
5  if dataset_size >= 10000:
6  print("✅ Excellent - Dataset size is ideal for DP")
7  print(" Expected: Good quality with ε ∈ [8, 12]")
8  elif dataset_size >= 5000:
9  print("⚠️ Moderate - Dataset may work with DP")
10  print(" Recommendation: Start with higher epsilon (ε=10-12)")
11  else:
12  print("❌ Small - DP may significantly reduce quality")
13  print(" Consider: Collecting more data or using DP without")
14 
15  print(f"\n Recommended delta: {1 / (dataset_size ** 1.2):.2e}")
16 
17 check_data_requirements(len(df))

Guidelines:

10,000+ records: Ideal for differential privacy
5,000-10,000 records: May work, use higher epsilon
< 5,000 records: Consider quality trade-offs carefully

Choosing Epsilon

1 def recommend_epsilon(dataset_size, sensitivity):
2     """
3     Recommend epsilon based on dataset characteristics.
4 
5     Args:
6     dataset_size: Number of records
7     sensitivity: 'high' for medical/financial, 'medium' for general, 'low' for public
8     """
9     recommendations = {"high": (1.0, 3.0), "medium": (3.0, 6.0), "low": (6.0, 10.0)}
10 
11     epsilon_range = recommendations[sensitivity]
12     delta = 1 / (dataset_size**1.2)
13 
14     print(f"📋 Recommendations for {dataset_size} records, {sensitivity} sensitivity:")
15     print(f" Epsilon range: {epsilon_range[0]} - {epsilon_range[1]}")
16     print(f" Delta: {delta:.2e}")
17     print(f" Stronger privacy: Use lower epsilon within range")
18     print(f" Better utility: Use higher epsilon within range")
19     print(f"\n Starting point: ε = {(epsilon_range[0] + epsilon_range[1]) / 2:.1f}")
20 
21 
22 recommend_epsilon(len(df), "medium")

Explicit Epsilon Guidance:

Start at ε ∈ [8, 12] for most use cases
Reduce epsilon gradually if stronger privacy is required
Monitor SQS scores to understand quality impact
Delta calculation: use "auto" or 1/n^1.2 where n is dataset size

Training Optimization

Differential privacy training requires special considerations:

1 # Optimal DP training configuration
2 builder_optimized = (
3     SafeSynthesizerJobBuilder(client)
4     .with_data_source(df)
5     .with_train(
6         batch_size=256,  # Larger batch sizes benefit DP
7         num_epochs=10,  # May need more epochs for convergence
8     )
9     .with_differential_privacy(epsilon=8.0, delta="auto", per_sample_max_grad_norm=1.0)
10     .synthesize()
11 )

Training Tips:

Use larger batch sizes - DP benefits from larger batches (reduces noise variance)
- Default batch size may be too small for optimal DP training
- Try batch_size=256 or 512 if GPU memory allows
- If memory errors occur, reduce batch size gradually
Monitor convergence - DP training may converge differently
- Watch training and validation loss
- May require more epochs than non-DP training
- Lower learning rate if training is unstable
Adjust gradient clipping - Controls sensitivity bound
- per_sample_max_grad_norm=1.0 is a good default
- Lower values (0.5) = stronger clipping, more privacy, potentially lower quality
- Higher values (1.5) = less clipping, less privacy, potentially better quality

Privacy Budget Management

Single Release: Only release one synthetic dataset per original dataset
Composition: If multiple releases needed, divide privacy budget accordingly
Documentation: Track all data releases and cumulative privacy budget
Renewal: Privacy budget doesn’t reset - consider this in data lifecycle
Testing: Test with higher epsilon before final release with lower epsilon

Troubleshooting

Low SQS with DP Enabled

If synthetic quality drops significantly:

1 # Try these approaches:
2 # 1. Increase epsilon (reduce privacy slightly)
3 # 2. Increase training data size
4 # 3. Increase training epochs
5 # 4. Adjust per_sample_max_grad_norm for gradient clipping
6 
7 builder_improved = (
8     SafeSynthesizerJobBuilder(client)
9     .with_data_source(df)
10     .with_train(
11         num_epochs=10  # More training
12     )
13     .with_replace_pii()
14     .with_differential_privacy(
15         epsilon=6.0,  # Slightly higher
16         delta=1e-5,
17         per_sample_max_grad_norm=1.5,  # Less aggressive clipping
18     )
19     .synthesize()
20 )