Safe Synthesizer 101

View as Markdown

Learn the fundamentals of NeMo Safe Synthesizer by creating your first Safe Synthesizer job using provided defaults. In this tutorial, you’ll upload sample customer data, replace personally identifiable information, fine-tune a model, generate synthetic records, and review the evaluation report.

Prerequisites

Before you begin, make sure that you have:

  • Access to a deployment of NeMo Safe Synthesizer (see getting-started)
  • An NVIDIA GPU with 80 GB+ VRAM — Safe Synthesizer requires GPU access for model training, even when using remote inference for other services. Verify with nvidia-smi.
  • Python environment with nemo-platform SDK installed
  • Basic understanding of Python and pandas

What You’ll Learn

By the end of this tutorial, you’ll understand how to:

  • Upload datasets for processing
  • Run Safe Synthesizer jobs using the Python SDK
  • Track job progress and retrieve results
  • Interpret evaluation reports

Step 1: Install the SDK

Install the NeMo Platform SDK with Safe Synthesizer support. Run the following command in a terminal (shell):

$if command -v uv &> /dev/null; then
$ uv pip install nemo-platform[all] kagglehub matplotlib
$else
$ pip install nemo-platform[all] kagglehub matplotlib
$fi

Step 2: Configure the Client

Set up the client to connect to your Safe Synthesizer deployment:

1import os
2from nemo_platform import NeMoPlatform
3
4# Configure the client
5client = NeMoPlatform(
6 base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
7 workspace="default",
8 access_token=os.environ.get("NMP_ACCESS_TOKEN"),
9)
10# set to none by default, update it if you need an hf_token
11hf_secret_name = None
12
13print("✅ Client configured successfully")

Step 3: Verify Service Connection

Test the connection to ensure Safe Synthesizer is accessible:

1try:
2 jobs = client.safe_synthesizer.jobs.list(workspace="default")
3 print("✅ Successfully connected to Safe Synthesizer service")
4 print(f"Found {len(jobs.data)} existing jobs")
5except Exception as e:
6 print(f"❌ Cannot connect to service: {e}")
7 print("Please verify base_url and service status")

Step 4: Load Sample Dataset

For this tutorial, we’ll use a women’s clothing reviews dataset from Kaggle that contains some PII:

1import pandas as pd
2import kagglehub # type: ignore[import-not-found]
3
4# Download the dataset
5path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
6df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
7
8print(f"✅ Loaded dataset with {len(df)} records")
9print("\nDataset preview:")
10print(df.head())

Dataset details:

  • Contains customer reviews of women’s clothing
  • Includes age, product category, rating, and review text
  • Some reviews contain PII like height, weight, age, and location

Step 5: Configure Column Classification

Before running jobs, set up column classification for accurate PII detection.

Column classification uses an LLM to automatically detect column types and improve PII detection accuracy. Without this setup, you may see classification errors and reduced detection quality.

1# Use the pre-configured NVIDIA Build model provider
2# This provider is set up automatically during platform deployment
3provider_name = os.environ.get("NSS_CLASSIFY_MODEL_PROVIDER", "default/nvidia-build")
4if "/" in provider_name:
5 provider_workspace, provider_id = provider_name.split("/", 1)
6 client.inference.providers.retrieve(provider_id, workspace=provider_workspace)
7else:
8 provider_id = provider_name
9 client.inference.providers.retrieve(provider_id)
10print(f"✅ Using model provider: {provider_name}")

If you prefer not to send column data to build.nvidia.com, you can deploy your own LLM and create a custom model provider. Pass the fully-qualified provider name (workspace/provider-name) to .with_classify_model_provider() instead.


Step 6: HuggingFace Token Usage (Optional)

If you’re using private HuggingFace models or want to avoid rate limits, create a secret for your HuggingFace token:

1import os
2import time
3
4# Create a unique secret name (use hyphens, not underscores)
5hf_secret_name = f"hf-token-{int(time.time())}"
6hf_token = os.environ.get("HF_TOKEN")
7
8if hf_token:
9 # Store your HuggingFace token as a platform secret
10 client.secrets.create(workspace="default", name=hf_secret_name, value=hf_token)
11 print(f"✓ Created secret: {hf_secret_name}")
12else:
13 hf_secret_name = None

Step 7: Create and Run a Safe Synthesizer Job

Use the SafeSynthesizerJobBuilder to configure and create a job:

1import pandas as pd
2from nemo_safe_synthesizer_plugin.sdk.job_builder import SafeSynthesizerJobBuilder
3
4# Create a project for our jobs (creates if it doesn't exist)
5project_name = "test-project"
6try:
7 client.projects.create(workspace="default", name=project_name)
8except Exception:
9 pass # Project may already exist
10
11# Build the job configuration
12job_name = f"synthesis-test-{pd.Timestamp.now().strftime('%Y%m%d-%H%M%S')}"
13builder = (
14 SafeSynthesizerJobBuilder(client)
15 .with_data_source(df)
16 .with_classify_model_provider(provider_name) # Enable column classification
17 .with_replace_pii() # Enable PII detection and replacement
18 .synthesize() # Enable synthesis
19)
20
21if hf_secret_name:
22 # add the token secret if an HF token was specified
23 builder = builder.with_hf_token_secret(hf_secret_name)
24
25# Create and start the job
26job = builder.create_job(name=job_name, project=project_name)
27print(f"✅ Job created: {job.job_name}")

What happens next:

  1. Dataset is uploaded to the fileset storage
  2. PII detection and replacement
  3. Model fine-tuning on your data
  4. Synthetic data generation
  5. Quality and privacy evaluation

Step 8: Monitor Job Progress

Check the job status:

1status = job.fetch_status()
2print(f"Current status: {status}")

Job States:

  • created: Job has been created
  • pending: Waiting for GPU resources
  • active: Processing your data
  • completed: Finished successfully
  • error: Encountered an error

View real-time logs:

1job.print_logs()

Wait for completion (this may take 15-30 minutes depending on data size):

1print("⏳ Waiting for job to complete...")
2try:
3 job.wait_for_completion()
4 print("✅ Job completed!")
5except RuntimeError as e:
6 print(f"❌ Job failed: {e}")
7 raise

wait_for_completion() raises RuntimeError if the job ends in an error or cancelled state. Check the printed status output and logs above for the cause.

If the job fails with “No GPUs available on this system”, ensure your quickstart is configured with GPU access:

$nemo setup --start-services

Verify GPU access with nvidia-smi on the host.


Step 9: Retrieve Synthetic Data

Once the job is complete, retrieve the generated synthetic data:

1synthetic_df = job.fetch_data()
2
3print(f"✅ Generated {len(synthetic_df)} synthetic records")
4print("\nSynthetic data preview:")
5print(synthetic_df.head())

Compare with original data structure:

1print("\n📊 Data Comparison:")
2print(f"Original shape: {df.shape}")
3print(f"Synthetic shape: {synthetic_df.shape}")
4print(f"\nOriginal columns: {list(df.columns)}")
5print(f"Synthetic columns: {list(synthetic_df.columns)}")

Step 10: Review Evaluation Report

Fetch the job summary with high-level metrics:

1summary = job.fetch_summary()
2
3print("📈 Evaluation Summary:")
4print(f" Synthetic Quality Score: {summary.synthetic_data_quality_score}")
5print(f" Data Privacy Score: {summary.data_privacy_score}")
6print(f" Valid Records: {summary.num_valid_records}/{summary.num_prompts}")

Download the full HTML evaluation report:

1job.save_report("./evaluation_report.html")
2print("✅ Evaluation report saved to evaluation_report.html")

If using Jupyter, display the report inline:

1job.display_report_in_notebook()

The evaluation report includes:

  • Synthetic Quality Score (SQS): Measures data utility
    • Column correlation stability
    • Deep structure stability
    • Column distribution stability
    • Text semantic similarity
    • Text structure similarity
  • Data Privacy Score (DPS): Measures privacy protection
    • Membership inference protection
    • Attribute inference protection
    • PII replay detection

Understanding the Results

Interpreting Scores

The evaluation report contains two high-level scores: Synthetic Quality Score (SQS) and Data Privacy Score (DPS). Both are measured out of 10, and higher is better.


Next Steps

Now that you’ve completed your first Safe Synthesizer job, try the local CLI path:

Try These Next

  1. Customize PII replacement: Configure specific entity types and replacement strategies
  2. Enable differential privacy: Add formal privacy guarantees with epsilon and delta parameters
  3. Tune generation parameters: Adjust temperature and sampling for better synthetic data
  4. Use your own data: Replace the sample dataset with your sensitive data

Cleanup

List and optionally delete completed jobs:

1# List all jobs
2all_jobs = client.safe_synthesizer.jobs.list(workspace="default")
3print(f"Total jobs: {len(all_jobs.data)}")
4
5# Delete this job (optional)
6# client.safe_synthesizer.jobs.delete(job.job_name, workspace="default")
7# print(f"✅ Job {job.job_name} deleted")

Troubleshooting

Common Issues

Connection errors:

  • Verify NMP_BASE_URL is correct
  • Check that Safe Synthesizer service is running
  • Ensure network connectivity

Job failures:

  • Check logs with job.print_logs()
  • Verify dataset format (CSV with proper columns)
  • Ensure sufficient GPU memory for model size

Slow performance:

  • Reduce dataset size for testing
  • Use smaller model (adjust training.pretrained_model)
  • Check GPU availability

For local CLI failures, see Local and Subprocess Execution.

Error: “Dataset must have at least 200 records to use holdout.”

This occurs when synthesis is enabled on datasets with fewer than 200 records. Holdout validation splits your data into training and test sets to measure quality, requiring a minimum dataset size.

Solution:

1builder = (
2 SafeSynthesizerJobBuilder(client)
3 .with_data_source(df)
4 .with_data(holdout=0) # Disable holdout for small datasets
5 .with_replace_pii()
6 .synthesize()
7)

Disabling holdout means you won’t get quality metrics like privacy scores and synthetic data quality scores. For production use, ensure your dataset has at least 200 records.