PII Replacement
PII (Personally Identifiable Information) replacement is a critical privacy protection step that detects and replaces sensitive information in your datasets before synthesis. This ensures that the model has no chance of learning the most sensitive information like names, addresses, and other identifiers.
How It Works
The PII replacement pipeline operates in multiple stages:
- Detection: Identifies PII entities using configurable detection methods
- Classification: Categorizes detected entities by type (name, email, address, and so on)
- Transformation: Replaces or redacts PII using configurable rules
- Validation: Verifies that sensitive information has been properly handled
Detection Methods
NeMo Safe Synthesizer supports multiple PII detection approaches:
Nemotron PII Detection
Uses the Nemotron PII model for entity recognition:
- Zero-shot entity detection
- Supports custom entity types
- High accuracy for standard PII categories
- Configurable confidence thresholds
LLM Classification
Leverages language models for PII detection:
- Contextual understanding of entities
- Handles complex PII patterns
- Flexible entity definitions
- Configurable prompts and models
Regex Detection
Pattern-based detection for structured PII:
- Fast and deterministic
- Ideal for known formats (SSN, phone numbers)
- Customizable patterns
- Low computational overhead
Replacement Strategies
After detection, PII can be handled in multiple ways:
- Replacement: Generate realistic replacements using Faker library or custom expressions.
- Redaction: Substitute with placeholder tokens.
- Hashing: Convert to a unique digital fingerprint (one-way).
- Custom Rules: Define your own transformation logic.
Supported Entity Types
Nemotron PII has been specifically fine-tuned to recognize many entity types out of the box, organized by category:
Personal Information
first_name- Given nameslast_name- Surnames and family namesname- Full namesemail- Email addressesphone_number- Phone numbers in various formatsfax_number- Fax numbers in various formats
Addresses
address- Complete physical addresses (for example, 123 Main Street, Anytown, CA 90210)street_address- Street addresses (for example, 123 Main Street)city- City namescounty- County namesstate- State/province namespostcode- Postal/ZIP codescountry- Country names
Personal Identifiers
ssn- Social Security Numbersnational_id- National ID numberstax_id- Tax ID numberscertificate_license_number- Driver’s license numbersunique_identifier- Generic unique IDscustomer_id- Customer identifiersemployee_id- Employee identifiers
Financial Information
credit_debit_card- Credit and debit card numberscvv- Credit card verification codepin- Personal identification numbersaccount_number- Bank account numbersbank_routing_number- Bank routing numbersswift_bic- Swift/BIC codesiban- International bank account numbers
Medical Information
medical_record_number- Medical record numbershealth_plan_beneficiary_number- Insurance IDsbiometric_identifier- Biometric data references
Technical Identifiers
url- Web URLsipv4- IPv4 addressesipv6- IPv6 addressesmac_address- Hardware MAC addressesapi_key- API keys and tokensuser_name- Usernamespassword- Passwordshttp_cookie- HTTP Cookiesdevice_identifier- Device IDs
Vehicle Identifiers
vehicle_identifier- Vehicle identification numbers (VINs)license_plate- License plates
Geographic Information
latitude- Latitude coordinateslongitude- Longitude coordinatescoordinate- Coordinate pairs
Quasi Identifiers
date- Date valuesdate_time- Date and time valuesdate_of_birth- Birth datestime- Time valuesage- Agesblood_type- Blood type informationgender- Gender informationsexuality- Sexual orientationpolitical_view- Political affiliationsrace_ethnicity- Race and ethnicity informationreligious_belief- Religious affiliationslanguage- Language preferenceseducation_level- Education leveloccupation- Professional titlesemployment_status- Employment informationcompany_name- Organization names
Custom Entity Types
Beyond these built-in types, you can define custom entities using:
- Nemotron PII: Fast, accurate zero-shot NER for standard and custom entity types
- Regex: Deterministic pattern matching, best for consistent formats (SSN, credit cards)
- LLM: Contextual understanding, handles complex patterns and ambiguous cases
Example Custom Entity:
Configuration
PII replacement is configured through the replace_pii section. For the full schema, refer to reference.
When to Use PII Replacement
Consider using PII replacement when:
- Your data contains names, addresses, or other direct identifiers
- Compliance requires PII removal before processing
- You want to ensure the model cannot memorize sensitive values
- You need to share synthetic data with external parties
PII replacement is always recommended as a preprocessing step before synthesis.
Related Topics
- safe-synthesizer-101: Getting started tutorial with PII replacement
- index: More tutorials