> For clean Markdown of any page, append .md to the page URL. > For a complete documentation index, see https://nemo-platform.docs.buildwithfern.com/nemo/platform/llms.txt. > For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://nemo-platform.docs.buildwithfern.com/nemo/platform/_mcp/server. # Differential Privacy Tutorial Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial explores the privacy-utility tradeoff and demonstrates how to configure differential privacy parameters for optimal results. If you have not yet completed the [Safe Synthesizer 101](/documentation/synthesize-safe-data/tutorials/safe-synthesizer-101) tutorial, consider starting there first. ## Prerequisites * Understanding of [differential privacy](/documentation/synthesize-safe-data/about/data-synthesis) * Safe Synthesizer deployment with GPU resources *** ## What You'll Learn * Understanding differential privacy concepts (epsilon, delta) * Configuring privacy hyperparameters * Analyzing privacy-utility tradeoffs * Interpreting privacy metrics in evaluation reports *** ## Understanding Differential Privacy Differential privacy (DP) provides mathematical guarantees that synthetic data doesn't reveal information about individual records in the training data. **Key Concepts:** * **Epsilon (ε)**: Privacy budget - lower values mean stronger privacy * ε = 1: Very strong privacy * ε = 6-10: Moderate privacy * ε > 10: Weak privacy * **Recommended starting range: ε ∈ \[8, 12]** - adjust downward based on privacy needs * **Delta (δ)**: Probability of privacy breach * Typically set to 1/n^1.2 where n is dataset size * Use `"auto"` for automatic calculation (recommended) * Manual values typically between 1e-6 and 1e-4 * **Noise**: Random noise added during training to prevent memorization * Calibrated based on epsilon, delta, and gradient clipping threshold * Higher privacy (lower epsilon) requires more noise ### Record-Level vs Group-Level Privacy By default, NeMo Safe Synthesizer uses **record-level** differential privacy, which protects individual records. For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), you can use **group-level** privacy by setting `group_training_examples_by` to the column that identifies each entity. See [Group-Level Privacy](#group-level-privacy) in the Advanced Configuration section for a code example. **When to use group-level privacy:** * Multiple records per person/entity in your dataset * Privacy guarantees should apply to entire entities, not individual records * Examples: patient medical histories, customer transaction logs *** ## Setup Install the NeMo Platform SDK with Safe Synthesizer support: ```shell if command -v uv &> /dev/null; then uv pip install nemo-platform[all] kagglehub matplotlib else pip install nemo-platform[all] kagglehub matplotlib fi ``` ```python import os import pandas as pd from nemo_platform import NeMoPlatform from nemo_safe_synthesizer_plugin.sdk.job_builder import SafeSynthesizerJobBuilder # Configure client client = NeMoPlatform( base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"), workspace="default", ) ``` *** ## Load and Prepare Data ```python # Load sample dataset import kagglehub # type: ignore[import-not-found] path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews") df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0) print(f"Dataset size: {len(df)} records") print(f"Recommended delta: {1 / (len(df) ** 2):.2e}") ``` *** ## Experiment 1: No Differential Privacy (Baseline) First, create a baseline without differential privacy: ```python import time print("🔬 Experiment 1: No Differential Privacy (Baseline)") builder_baseline = ( SafeSynthesizerJobBuilder(client) .with_data_source(df) .with_replace_pii() .synthesize() ) # Create a project for our jobs (creates if it doesn't exist) project_name = "test-project" try: client.projects.create(workspace="default", name=project_name) except Exception: pass # Project may already exist job_baseline = builder_baseline.create_job( name=f"dp-baseline-{int(time.time())}", project="test-project" ) print(f"✅ Baseline job created: {job_baseline.job_name}") job_baseline.wait_for_completion() summary_baseline = job_baseline.fetch_summary() print(f"\n📊 Baseline Results:") print(f" SQS (Quality): {summary_baseline.synthetic_data_quality_score}") print(f" DPS (Privacy): {summary_baseline.data_privacy_score}") ``` *** ## Experiment 2: Moderate Privacy (ε=6) Apply moderate differential privacy: ```python print("\n🔬 Experiment 2: Moderate Privacy (ε=6)") builder_moderate = ( SafeSynthesizerJobBuilder(client) .with_data_source(df) .with_replace_pii() .with_differential_privacy(epsilon=6.0, delta=1e-5) .synthesize() ) job_moderate = builder_moderate.create_job( name=f"dp-moderate-{int(time.time())}", project="test-project" ) print(f"✅ Moderate privacy job created: {job_moderate.job_name}") job_moderate.wait_for_completion() summary_moderate = job_moderate.fetch_summary() print(f"\n📊 Moderate Privacy Results:") print(f" SQS (Quality): {summary_moderate.synthetic_data_quality_score}") print(f" DPS (Privacy): {summary_moderate.data_privacy_score}") ``` *** ## Experiment 3: Strong Privacy (ε=1) Apply strong differential privacy: ```python print("\n🔬 Experiment 3: Strong Privacy (ε=1)") builder_strong = ( SafeSynthesizerJobBuilder(client) .with_data_source(df) .with_replace_pii() .with_differential_privacy(epsilon=1.0, delta=1e-5) .synthesize() ) job_strong = builder_strong.create_job( name=f"dp-strong-{int(time.time())}", project="test-project" ) print(f"✅ Strong privacy job created: {job_strong.job_name}") job_strong.wait_for_completion() summary_strong = job_strong.fetch_summary() print(f"\n📊 Strong Privacy Results:") print(f" SQS (Quality): {summary_strong.synthetic_data_quality_score}") print(f" DPS (Privacy): {summary_strong.data_privacy_score}") ``` *** ## Compare Results Visualize the privacy-utility tradeoff: ```python import matplotlib.pyplot as plt experiments = ["Baseline\n(No DP)", "Moderate\n(ε=6)", "Strong\n(ε=1)"] sqs_scores = [ summary_baseline.synthetic_data_quality_score, summary_moderate.synthetic_data_quality_score, summary_strong.synthetic_data_quality_score, ] dps_scores = [ summary_baseline.data_privacy_score, summary_moderate.data_privacy_score, summary_strong.data_privacy_score, ] def _safe_scores(scores): """Replace None values with 0 so matplotlib and format strings don't error.""" return [s if s is not None else 0 for s in scores] safe_sqs = _safe_scores(sqs_scores) safe_dps = _safe_scores(dps_scores) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # SQS comparison ax1.bar(experiments, safe_sqs, color=["blue", "green", "red"], alpha=0.7) ax1.set_ylabel("Score") ax1.set_title("Synthetic Quality Score (SQS)") ax1.set_ylim([0, 100]) ax1.axhline(y=70, color="gray", linestyle="--", label="Good threshold") ax1.legend() # DPS comparison ax2.bar(experiments, safe_dps, color=["blue", "green", "red"], alpha=0.7) ax2.set_ylabel("Score") ax2.set_title("Data Privacy Score (DPS)") ax2.set_ylim([0, 100]) ax2.axhline(y=70, color="gray", linestyle="--", label="Good threshold") ax2.legend() plt.tight_layout() plt.show() print("\n📈 Privacy-Utility Tradeoff Summary:") print(f"{'Experiment':<20} {'SQS (Utility)':<15} {'DPS (Privacy)':<15}") print("-" * 50) for i, exp in enumerate(experiments): sqs_val = f"{sqs_scores[i]:<15.1f}" if sqs_scores[i] is not None else "N/A " dps_val = f"{dps_scores[i]:<15.1f}" if dps_scores[i] is not None else "N/A " print(f"{exp.strip():<20} {sqs_val} {dps_val}") ``` *** ## Advanced Configuration ### Custom Privacy Budget Configure differential privacy with custom parameters: ```python from nemo_safe_synthesizer_plugin.sdk.config import DifferentialPrivacyHyperparams # Create custom privacy configuration privacy_config = DifferentialPrivacyHyperparams( dp_enabled=True, epsilon=3.0, delta=1e-5, per_sample_max_grad_norm=1.0, # Gradient clipping threshold ) # Use with SafeSynthesizerJobBuilder builder_custom = ( SafeSynthesizerJobBuilder(client) .with_data_source(df) .with_replace_pii() .with_differential_privacy(config=privacy_config) .synthesize() ) ``` ### Group-Level Privacy For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), group-level privacy protects entire entities rather than individual records: ```python # Group-level privacy for multi-record entities builder_grouped = ( SafeSynthesizerJobBuilder(client) .with_data_source(df) .with_train( group_training_examples_by="patient_id" # Group records by patient ) .with_differential_privacy(epsilon=8.0) .synthesize() ) ``` ### Privacy Budget Composition When running multiple experiments, the privacy budget compounds: ```python # Total privacy budget across experiments total_epsilon = 0.0 # No DP baseline total_epsilon += 6.0 # Moderate privacy total_epsilon += 1.0 # Strong privacy print(f"\n🔐 Total Privacy Budget Consumed: ε = {total_epsilon}") print("Note: Each additional release compounds the privacy budget") print("Best practice: Only release one synthetic dataset per original dataset") ``` *** ## Interpreting Privacy Metrics ### Membership Inference Attack (MIA) Measures if an attacker can determine whether a record was in training data: ```python # Fetch detailed evaluation reports baseline_report = job_baseline.fetch_summary() moderate_report = job_moderate.fetch_summary() print("\n🛡️ Membership Inference Protection:") print(f"Baseline: {baseline_report.membership_inference_protection_score}") print(f"Moderate (ε=6): {moderate_report.membership_inference_protection_score}") print("\nInterpretation:") print("- Higher score = Better protection") print("- Score > 0.5 means attacker cannot reliably identify training records") ``` ### Attribute Inference Attack (AIA) Measures if sensitive attributes can be inferred from other attributes: ```python print("\n🔍 Attribute Inference Protection:") print(f"Baseline: {baseline_report.attribute_inference_protection_score}") print(f"Moderate (ε=6): {moderate_report.attribute_inference_protection_score}") print("\nInterpretation:") print("- Higher score = Better protection") print("- Measures difficulty of inferring sensitive values from known attributes") ``` *** ## Best Practices ### Data Size Requirements Differential privacy works best with larger datasets: ```python def check_data_requirements(dataset_size): """Check if dataset size is suitable for DP.""" print(f"📏 Dataset Size Analysis: {dataset_size} records") if dataset_size >= 10000: print("✅ Excellent - Dataset size is ideal for DP") print(" Expected: Good quality with ε ∈ [8, 12]") elif dataset_size >= 5000: print("⚠️ Moderate - Dataset may work with DP") print(" Recommendation: Start with higher epsilon (ε=10-12)") else: print("❌ Small - DP may significantly reduce quality") print(" Consider: Collecting more data or using DP without") print(f"\n Recommended delta: {1 / (dataset_size ** 1.2):.2e}") check_data_requirements(len(df)) ``` **Guidelines:** * **10,000+ records**: Ideal for differential privacy * **5,000-10,000 records**: May work, use higher epsilon * **\< 5,000 records**: Consider quality trade-offs carefully ### Choosing Epsilon ```python def recommend_epsilon(dataset_size, sensitivity): """ Recommend epsilon based on dataset characteristics. Args: dataset_size: Number of records sensitivity: 'high' for medical/financial, 'medium' for general, 'low' for public """ recommendations = {"high": (1.0, 3.0), "medium": (3.0, 6.0), "low": (6.0, 10.0)} epsilon_range = recommendations[sensitivity] delta = 1 / (dataset_size**1.2) print(f"📋 Recommendations for {dataset_size} records, {sensitivity} sensitivity:") print(f" Epsilon range: {epsilon_range[0]} - {epsilon_range[1]}") print(f" Delta: {delta:.2e}") print(f" Stronger privacy: Use lower epsilon within range") print(f" Better utility: Use higher epsilon within range") print(f"\n Starting point: ε = {(epsilon_range[0] + epsilon_range[1]) / 2:.1f}") recommend_epsilon(len(df), "medium") ``` **Explicit Epsilon Guidance:** * Start at **ε ∈ \[8, 12]** for most use cases * Reduce epsilon gradually if stronger privacy is required * Monitor SQS scores to understand quality impact * Delta calculation: use `"auto"` or `1/n^1.2` where n is dataset size ### Training Optimization Differential privacy training requires special considerations: ```python # Optimal DP training configuration builder_optimized = ( SafeSynthesizerJobBuilder(client) .with_data_source(df) .with_train( batch_size=256, # Larger batch sizes benefit DP num_epochs=10, # May need more epochs for convergence ) .with_differential_privacy(epsilon=8.0, delta="auto", per_sample_max_grad_norm=1.0) .synthesize() ) ``` **Training Tips:** 1. **Use larger batch sizes** - DP benefits from larger batches (reduces noise variance) * Default batch size may be too small for optimal DP training * Try batch\_size=256 or 512 if GPU memory allows * If memory errors occur, reduce batch size gradually 2. **Monitor convergence** - DP training may converge differently * Watch training and validation loss * May require more epochs than non-DP training * Lower learning rate if training is unstable 3. **Adjust gradient clipping** - Controls sensitivity bound * `per_sample_max_grad_norm=1.0` is a good default * Lower values (0.5) = stronger clipping, more privacy, potentially lower quality * Higher values (1.5) = less clipping, less privacy, potentially better quality ### Privacy Budget Management 1. **Single Release**: Only release one synthetic dataset per original dataset 2. **Composition**: If multiple releases needed, divide privacy budget accordingly 3. **Documentation**: Track all data releases and cumulative privacy budget 4. **Renewal**: Privacy budget doesn't reset - consider this in data lifecycle 5. **Testing**: Test with higher epsilon before final release with lower epsilon *** ## Troubleshooting ### Low SQS with DP Enabled If synthetic quality drops significantly: ```python # Try these approaches: # 1. Increase epsilon (reduce privacy slightly) # 2. Increase training data size # 3. Increase training epochs # 4. Adjust per_sample_max_grad_norm for gradient clipping builder_improved = ( SafeSynthesizerJobBuilder(client) .with_data_source(df) .with_train( num_epochs=10 # More training ) .with_replace_pii() .with_differential_privacy( epsilon=6.0, # Slightly higher delta=1e-5, per_sample_max_grad_norm=1.5, # Less aggressive clipping ) .synthesize() ) ```