Run an Anonymizer Job
This tutorial walks through the anonymizer.run job: defining a run spec, executing it locally or on the NeMo Platform Jobs worker, and loading the parquet artifacts it produces.
For detection, rewrite, and replacement strategy details, see the open-source library documentation.
Prerequisites
Complete the tutorials prerequisites, which cover:
- A running NeMo Platform cluster with the
nemo anonymizerCLI available (see Setup). - An inference provider configured (default examples use
nvidia-build). - A fileset named
anonymizer-inputswithanonymizer-input.csvuploaded.
What run Does
anonymizer.run executes the full Anonymizer pipeline on every record of an input file and writes the output as job artifacts.
There are three run commands:
Job artifacts (under the artifacts/ directory):
Step 1: Build an AnonymizerRequest
AnonymizerRequest contains the execution fields shared by preview and run (config, data, model_configs, and selected_models). A run processes the full input file, so it does not include num_records:
Step 2: Write the Spec to YAML
The CLI run commands read a YAML spec file. Serialize the AnonymizerRequest directly:
Step 3: Run the Job
Choose one execution path. Option A runs in the local CLI process. Option B submits the same request to the NeMo Platform Jobs worker.
Option A: Run Locally
The local job context runs the Anonymizer library Anonymizer.run(...) in-process, then writes artifacts through the generated local job results manager.
Expected output:
run run does not echo the artifact path on stdout. The local job results manager logs the path to stderr in the form:
Use that path in the next step.
Option B: Submit to the Jobs Worker
To execute the same spec on the NeMo Platform Jobs worker instead of in the CLI process, use run submit:
The command prints the assigned job name. You need that name to poll status and download artifacts in Step 4.
The SDK equivalent is sdk.anonymizer.run(request). It posts the request to the plugin’s /jobs/run endpoint and returns an AnonymizerJobResource:
Compared to run run, the submit path:
- Rejects local file paths in
data.source— use a fileset reference (<fileset>#<path>) orhttp(s)URL. - Requires explicit
model_configsreferencing Inference Gateway providers, because the job runs outside the CLI process and cannot inherit Data Designer’s locally-defined providers.
Step 4: Get Results
Option A Results: Local Run
For run run, the result already exists on the local filesystem. Use the artifact directory printed in stderr:
Then load the parquet artifacts from that directory:
The trace dataset (and the dataset itself for annotate / substitute strategies) contains pyarrow-backed struct<entities: list<...>> columns. If you need plain Python dict/list values for JSON output, use pyarrow.parquet:
Option B Results: Remote Run
For run submit, track the platform job first. The job is ready for artifact download when its status is completed:
To download from the CLI, fetch the artifacts result and extract it:
Then point AnonymizerJobResults at the extracted artifacts directory:
If you used the SDK, use the AnonymizerJobResource methods directly. get_job_status() reads the current status, check_if_complete() tests whether artifacts are ready, wait_until_done() blocks until a terminal state, and download_artifacts() downloads and extracts the result:
AnonymizerJobResults exposes load_dataset(), load_trace(), load_failed_records(), and display_record() over the same underlying files. See SDK Resources.
Inspect the Schema Without Running
run explain prints the job key, submit endpoint, and JSON schemas for AnonymizerRequest and the canonical AnonymizerStepConfig:
This is useful when authoring a spec programmatically or wiring the job into another tool.
How the Job Compiles
For each request, the plugin:
- Validates the Anonymizer library
AnonymizerConfig. - Validates the input source (rejects local paths on remote execution; checks fileset refs).
- Validates that
selected_modelsoverrides also havemodel_configs. - Resolves
model_configsproviders — locally-defined Data Designer providers first, then Inference Gateway providers. Remote execution (run submit) resolves only through the Inference Gateway. - Renders a unified
model_configsYAML body for the library. - Stores the resolved providers and YAML in the internal
AnonymizerStepConfigconsumed by the worker (in-process forrun run, or on the Jobs worker forrun submit).
For run submit, provider endpoints are re-resolved at runtime so the job uses the in-cluster Inference Gateway address rather than the address captured at submission time.
Next Steps
- Iterate faster with preview before scaling to a full job.
- Refer to SDK Resources for
AnonymizerJobResourceandAnonymizerJobResultsdetails. - Replacement strategy parameters and rewrite mode are documented in the library docs.