Safe Synthesizer 101
Learn the fundamentals of NeMo Safe Synthesizer by creating your first Safe Synthesizer job using provided defaults. In this tutorial, you’ll upload sample customer data, replace personally identifiable information, fine-tune a model, generate synthetic records, and review the evaluation report.
Prerequisites
Before you begin, make sure that you have:
- Access to a deployment of NeMo Safe Synthesizer (see getting-started)
- An NVIDIA GPU with 80 GB+ VRAM — Safe Synthesizer requires GPU access for model training, even when using remote inference for other services. Verify with
nvidia-smi. - Python environment with
nemo-platformSDK installed - Basic understanding of Python and pandas
What You’ll Learn
By the end of this tutorial, you’ll understand how to:
- Upload datasets for processing
- Run Safe Synthesizer jobs using the Python SDK
- Track job progress and retrieve results
- Interpret evaluation reports
Step 1: Install the SDK
Install the NeMo Platform SDK with Safe Synthesizer support. Run the following command in a terminal (shell):
Step 2: Configure the Client
Set up the client to connect to your Safe Synthesizer deployment:
Step 3: Verify Service Connection
Test the connection to ensure Safe Synthesizer is accessible:
Step 4: Load Sample Dataset
For this tutorial, we’ll use a women’s clothing reviews dataset from Kaggle that contains some PII:
Dataset details:
- Contains customer reviews of women’s clothing
- Includes age, product category, rating, and review text
- Some reviews contain PII like height, weight, age, and location
Step 5: Configure Column Classification
Before running jobs, set up column classification for accurate PII detection.
Column classification uses an LLM to automatically detect column types and improve PII detection accuracy. Without this setup, you may see classification errors and reduced detection quality.
If you prefer not to send column data to build.nvidia.com, you can deploy your own LLM and create a custom model provider. Pass the fully-qualified provider name (workspace/provider-name) to .with_classify_model_provider() instead.
Step 6: HuggingFace Token Usage (Optional)
If you’re using private HuggingFace models or want to avoid rate limits, create a secret for your HuggingFace token:
Step 7: Create and Run a Safe Synthesizer Job
Use the SafeSynthesizerJobBuilder to configure and create a job:
What happens next:
- Dataset is uploaded to the fileset storage
- PII detection and replacement
- Model fine-tuning on your data
- Synthetic data generation
- Quality and privacy evaluation
Step 8: Monitor Job Progress
Check the job status:
Job States:
created: Job has been createdpending: Waiting for GPU resourcesactive: Processing your datacompleted: Finished successfullyerror: Encountered an error
View real-time logs:
Wait for completion (this may take 15-30 minutes depending on data size):
wait_for_completion() raises RuntimeError if the job ends in an error or cancelled state. Check the printed status output and logs above for the cause.
If the job fails with “No GPUs available on this system”, ensure your quickstart is configured with GPU access:
Verify GPU access with nvidia-smi on the host.
Step 9: Retrieve Synthetic Data
Once the job is complete, retrieve the generated synthetic data:
Compare with original data structure:
Step 10: Review Evaluation Report
Fetch the job summary with high-level metrics:
Download the full HTML evaluation report:
If using Jupyter, display the report inline:
The evaluation report includes:
- Synthetic Quality Score (SQS): Measures data utility
- Column correlation stability
- Deep structure stability
- Column distribution stability
- Text semantic similarity
- Text structure similarity
- Data Privacy Score (DPS): Measures privacy protection
- Membership inference protection
- Attribute inference protection
- PII replay detection
Understanding the Results
Interpreting Scores
The evaluation report contains two high-level scores: Synthetic Quality Score (SQS) and Data Privacy Score (DPS). Both are measured out of 10, and higher is better.
Next Steps
Now that you’ve completed your first Safe Synthesizer job, try the local CLI path:
- Local and Subprocess Execution - run Safe Synthesizer directly on a host GPU
- Getting Started - review local runtime prerequisites
Try These Next
- Customize PII replacement: Configure specific entity types and replacement strategies
- Enable differential privacy: Add formal privacy guarantees with epsilon and delta parameters
- Tune generation parameters: Adjust temperature and sampling for better synthetic data
- Use your own data: Replace the sample dataset with your sensitive data
Cleanup
List and optionally delete completed jobs:
Troubleshooting
Common Issues
Connection errors:
- Verify
NMP_BASE_URLis correct - Check that Safe Synthesizer service is running
- Ensure network connectivity
Job failures:
- Check logs with
job.print_logs() - Verify dataset format (CSV with proper columns)
- Ensure sufficient GPU memory for model size
Slow performance:
- Reduce dataset size for testing
- Use smaller model (adjust
training.pretrained_model) - Check GPU availability
For local CLI failures, see Local and Subprocess Execution.
Error: “Dataset must have at least 200 records to use holdout.”
This occurs when synthesis is enabled on datasets with fewer than 200 records. Holdout validation splits your data into training and test sets to measure quality, requiring a minimum dataset size.
Solution:
Disabling holdout means you won’t get quality metrics like privacy scores and synthetic data quality scores. For production use, ensure your dataset has at least 200 records.