Run an Anonymizer Job | NVIDIA NeMo Platform

This tutorial walks through the anonymizer.run job: defining a run spec, executing it locally or on the NeMo Platform Jobs worker, and loading the parquet artifacts it produces.

For detection, rewrite, and replacement strategy details, see the open-source library documentation.

Prerequisites

Complete the tutorials prerequisites, which cover:

A running NeMo Platform cluster with the nemo anonymizer CLI available (see Setup).
An inference provider configured (default examples use nvidia-build).
A fileset named anonymizer-inputs with anonymizer-input.csv uploaded.

What `run` Does

anonymizer.run executes the full Anonymizer pipeline on every record of an input file and writes the output as job artifacts.

There are three run commands:

Command	Where it runs	Local paths	`model_configs` required	Artifacts
`nemo anonymizer run run`	Local CLI process via generated `is_local` path	Allowed	Optional	Written under `persistent/results/artifacts` locally
`nemo anonymizer run submit`	NeMo Platform Jobs worker	Rejected	Required	Stored in NeMo Platform job artifact storage; pull with `download_artifacts()`
`nemo anonymizer run explain`	Local schema introspection	n/a	n/a	Prints job key, submit endpoint, and input/spec schemas

Job artifacts (under the artifacts/ directory):

File	Description
`dataset.parquet`	User-facing anonymized dataframe (replace/rewrite output).
`trace.parquet`	Internal trace dataframe with detection details.
`metadata.json`	Run metadata (includes the original text column name).
`failed_records.json`	Per-record failures with reasons. Only written when records failed.

Step 1: Build an `AnonymizerRequest`

AnonymizerRequest contains the execution fields shared by preview and run (config, data, model_configs, and selected_models). A run processes the full input file, so it does not include num_records:

1 import os
2 from anonymizer.config.anonymizer_config import AnonymizerConfig
3 from anonymizer.config.replace_strategies import Redact
4 from data_designer.config import ModelConfig
5 from nemo_anonymizer_plugin.app.input import AnonymizerInputSpec
6 from nemo_anonymizer_plugin.app.task_config import AnonymizerRequest
7 
8 WORKSPACE = os.environ.get("NMP_WORKSPACE", "default")
9 MODEL_PROVIDER = os.environ.get("NMP_ANON_PROVIDER", "nvidia-build")
10 
11 config = AnonymizerConfig(
12     replace=Redact(format_template="[REDACTED_{label}]"),
13 )
14 
15 model_configs = [
16     ModelConfig(alias="gliner-pii-detector", provider=MODEL_PROVIDER, model="nvidia/gliner-pii"),
17     ModelConfig(alias="gpt-oss-120b", provider=MODEL_PROVIDER, model="openai/gpt-oss-120b"),
18     ModelConfig(alias="nemotron-30b-thinking", provider=MODEL_PROVIDER, model="nvidia/nemotron-3-nano-30b-a3b"),
19 ]
20 
21 request = AnonymizerRequest(
22     config=config,
23     data=AnonymizerInputSpec(
24         source=f"fileset://{WORKSPACE}/anonymizer-inputs#anonymizer-input.csv",
25         text_column="biography",
26         id_column="id",
27     ),
28     model_configs=model_configs,
29 )

Step 2: Write the Spec to YAML

The CLI run commands read a YAML spec file. Serialize the AnonymizerRequest directly:

1 import yaml
2 from pathlib import Path
3 
4 spec_path = Path("/tmp/anonymizer-run.yaml")
5 spec_path.write_text(yaml.safe_dump(request.model_dump(mode="json", exclude_none=True)))

Step 3: Run the Job

Choose one execution path. Option A runs in the local CLI process. Option B submits the same request to the NeMo Platform Jobs worker.

Option A: Run Locally

$ nemo anonymizer run run --spec-file /tmp/anonymizer-run.yaml

The local job context runs the Anonymizer library Anonymizer.run(...) in-process, then writes artifacts through the generated local job results manager.

Expected output:

1 {"exit_code": 0}

run run does not echo the artifact path on stdout. The local job results manager logs the path to stderr in the form:

Saved result 'artifacts' to file:///.../persistent/results/artifacts

Use that path in the next step.

Option B: Submit to the Jobs Worker

To execute the same spec on the NeMo Platform Jobs worker instead of in the CLI process, use run submit:

$ nemo anonymizer run submit \
>   --spec-file /tmp/anonymizer-run.yaml \
>   --workspace "${NMP_WORKSPACE:-default}" \
>   --base-url "${NMP_BASE_URL:-http://localhost:8080}"

The command prints the assigned job name. You need that name to poll status and download artifacts in Step 4.

The SDK equivalent is sdk.anonymizer.run(request). It posts the request to the plugin’s /jobs/run endpoint and returns an AnonymizerJobResource:

1 import os
2 from nemo_platform import NeMoPlatform
3 
4 sdk = NeMoPlatform(
5     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
6     workspace=WORKSPACE,
7 )
8 job = sdk.anonymizer.run(request)

Compared to run run, the submit path:

Rejects local file paths in data.source — use a fileset reference (<fileset>#<path>) or http(s) URL.
Requires explicit model_configs referencing Inference Gateway providers, because the job runs outside the CLI process and cannot inherit Data Designer’s locally-defined providers.

Step 4: Get Results

Option A Results: Local Run

For run run, the result already exists on the local filesystem. Use the artifact directory printed in stderr:

$ ARTIFACTS_DIR=/path/to/persistent/results/artifacts
$ ls "$ARTIFACTS_DIR"

Then load the parquet artifacts from that directory:

1 import json
2 from pathlib import Path
3 
4 import pandas as pd
5 
6 artifacts_dir = Path("/path/to/persistent/results/artifacts")  # from the stderr log
7 
8 metadata = json.loads((artifacts_dir / "metadata.json").read_text())
9 dataset = pd.read_parquet(artifacts_dir / "dataset.parquet", dtype_backend="pyarrow")
10 trace = pd.read_parquet(artifacts_dir / "trace.parquet", dtype_backend="pyarrow")
11 
12 failed_path = artifacts_dir / "failed_records.json"
13 failed_records = json.loads(failed_path.read_text()) if failed_path.exists() else []
14 
15 print(dataset.head())
16 print(f"records={len(dataset)} failures={len(failed_records)}")

The trace dataset (and the dataset itself for annotate / substitute strategies) contains pyarrow-backed struct<entities: list<...>> columns. If you need plain Python dict/list values for JSON output, use pyarrow.parquet:

1 import pyarrow.parquet as pq
2 
3 table = pq.read_table(artifacts_dir / "dataset.parquet")
4 records = table.slice(0, 5).to_pylist()

Option B Results: Remote Run

For run submit, track the platform job first. The job is ready for artifact download when its status is completed:

$ # Replace with the job name printed by `run submit`.
$ nemo jobs get-status <job-name> --workspace "${NMP_WORKSPACE:-default}"
$ nemo jobs get-logs <job-name> --workspace "${NMP_WORKSPACE:-default}"

To download from the CLI, fetch the artifacts result and extract it:

$ nemo jobs results download artifacts \
>   --job <job-name> \
>   --workspace "${NMP_WORKSPACE:-default}" \
>   --output-file /tmp/anonymizer-artifacts.tar.gz
$ 
$ mkdir -p /tmp/anonymizer-artifacts
$ tar -xzf /tmp/anonymizer-artifacts.tar.gz -C /tmp/anonymizer-artifacts
$ ls /tmp/anonymizer-artifacts/artifacts

Then point AnonymizerJobResults at the extracted artifacts directory:

1 from pathlib import Path
2 
3 from nemo_anonymizer_plugin.sdk.job_results import AnonymizerJobResults
4 
5 results = AnonymizerJobResults(Path("/tmp/anonymizer-artifacts/artifacts"))
6 
7 dataset = results.load_dataset()
8 trace   = results.load_trace()
9 failed  = results.load_failed_records()

If you used the SDK, use the AnonymizerJobResource methods directly. get_job_status() reads the current status, check_if_complete() tests whether artifacts are ready, wait_until_done() blocks until a terminal state, and download_artifacts() downloads and extracts the result:

1 job = sdk.anonymizer.run(request)
2 
3 status = job.get_job_status()
4 is_done = job.check_if_complete()
5 
6 job.wait_until_done()
7 results = job.download_artifacts()
8 
9 dataset = results.load_dataset()
10 trace   = results.load_trace()
11 failed  = results.load_failed_records()

AnonymizerJobResults exposes load_dataset(), load_trace(), load_failed_records(), and display_record() over the same underlying files. See SDK Resources.

Inspect the Schema Without Running

run explain prints the job key, submit endpoint, and JSON schemas for AnonymizerRequest and the canonical AnonymizerStepConfig:

$ nemo anonymizer run explain

This is useful when authoring a spec programmatically or wiring the job into another tool.

How the Job Compiles

For each request, the plugin:

Validates the Anonymizer library AnonymizerConfig.
Validates the input source (rejects local paths on remote execution; checks fileset refs).
Validates that selected_models overrides also have model_configs.
Resolves model_configs providers — locally-defined Data Designer providers first, then Inference Gateway providers. Remote execution (run submit) resolves only through the Inference Gateway.
Renders a unified model_configs YAML body for the library.
Stores the resolved providers and YAML in the internal AnonymizerStepConfig consumed by the worker (in-process for run run, or on the Jobs worker for run submit).

For run submit, provider endpoints are re-resolved at runtime so the job uses the in-cluster Inference Gateway address rather than the address captured at submission time.

Next Steps

Iterate faster with preview before scaling to a full job.
Refer to SDK Resources for AnonymizerJobResource and AnonymizerJobResults details.
Replacement strategy parameters and rewrite mode are documented in the library docs.