Differential Privacy Tutorial
Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial explores the privacy-utility tradeoff and demonstrates how to configure differential privacy parameters for optimal results.
If you have not yet completed the Safe Synthesizer 101 tutorial, consider starting there first.
Prerequisites
- Understanding of differential privacy
- Safe Synthesizer deployment with GPU resources
What You’ll Learn
- Understanding differential privacy concepts (epsilon, delta)
- Configuring privacy hyperparameters
- Analyzing privacy-utility tradeoffs
- Interpreting privacy metrics in evaluation reports
Understanding Differential Privacy
Differential privacy (DP) provides mathematical guarantees that synthetic data doesn’t reveal information about individual records in the training data.
Key Concepts:
-
Epsilon (ε): Privacy budget - lower values mean stronger privacy
- ε = 1: Very strong privacy
- ε = 6-10: Moderate privacy
- ε > 10: Weak privacy
- Recommended starting range: ε ∈ [8, 12] - adjust downward based on privacy needs
-
Delta (δ): Probability of privacy breach
- Typically set to 1/n^1.2 where n is dataset size
- Use
"auto"for automatic calculation (recommended) - Manual values typically between 1e-6 and 1e-4
-
Noise: Random noise added during training to prevent memorization
- Calibrated based on epsilon, delta, and gradient clipping threshold
- Higher privacy (lower epsilon) requires more noise
Record-Level vs Group-Level Privacy
By default, NeMo Safe Synthesizer uses record-level differential privacy, which protects individual records. For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), you can use group-level privacy by setting group_training_examples_by to the column that identifies each entity. See Group-Level Privacy in the Advanced Configuration section for a code example.
When to use group-level privacy:
- Multiple records per person/entity in your dataset
- Privacy guarantees should apply to entire entities, not individual records
- Examples: patient medical histories, customer transaction logs
Setup
Install the NeMo Platform SDK with Safe Synthesizer support:
Load and Prepare Data
Experiment 1: No Differential Privacy (Baseline)
First, create a baseline without differential privacy:
Experiment 2: Moderate Privacy (ε=6)
Apply moderate differential privacy:
Experiment 3: Strong Privacy (ε=1)
Apply strong differential privacy:
Compare Results
Visualize the privacy-utility tradeoff:
Advanced Configuration
Custom Privacy Budget
Configure differential privacy with custom parameters:
Group-Level Privacy
For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), group-level privacy protects entire entities rather than individual records:
Privacy Budget Composition
When running multiple experiments, the privacy budget compounds:
Interpreting Privacy Metrics
Membership Inference Attack (MIA)
Measures if an attacker can determine whether a record was in training data:
Attribute Inference Attack (AIA)
Measures if sensitive attributes can be inferred from other attributes:
Best Practices
Data Size Requirements
Differential privacy works best with larger datasets:
Guidelines:
- 10,000+ records: Ideal for differential privacy
- 5,000-10,000 records: May work, use higher epsilon
- < 5,000 records: Consider quality trade-offs carefully
Choosing Epsilon
Explicit Epsilon Guidance:
- Start at ε ∈ [8, 12] for most use cases
- Reduce epsilon gradually if stronger privacy is required
- Monitor SQS scores to understand quality impact
- Delta calculation: use
"auto"or1/n^1.2where n is dataset size
Training Optimization
Differential privacy training requires special considerations:
Training Tips:
-
Use larger batch sizes - DP benefits from larger batches (reduces noise variance)
- Default batch size may be too small for optimal DP training
- Try batch_size=256 or 512 if GPU memory allows
- If memory errors occur, reduce batch size gradually
-
Monitor convergence - DP training may converge differently
- Watch training and validation loss
- May require more epochs than non-DP training
- Lower learning rate if training is unstable
-
Adjust gradient clipping - Controls sensitivity bound
per_sample_max_grad_norm=1.0is a good default- Lower values (0.5) = stronger clipping, more privacy, potentially lower quality
- Higher values (1.5) = less clipping, less privacy, potentially better quality
Privacy Budget Management
- Single Release: Only release one synthetic dataset per original dataset
- Composition: If multiple releases needed, divide privacy budget accordingly
- Documentation: Track all data releases and cumulative privacy budget
- Renewal: Privacy budget doesn’t reset - consider this in data lifecycle
- Testing: Test with higher epsilon before final release with lower epsilon
Troubleshooting
Low SQS with DP Enabled
If synthetic quality drops significantly: