Run an Anonymizer Job

View as Markdown

This tutorial walks through the anonymizer.run job: defining a run spec, executing it locally or on the NeMo Platform Jobs worker, and loading the parquet artifacts it produces.

For detection, rewrite, and replacement strategy details, see the open-source library documentation.

Prerequisites

Complete the tutorials prerequisites, which cover:

  • A running NeMo Platform cluster with the nemo anonymizer CLI available (see Setup).
  • An inference provider configured (default examples use nvidia-build).
  • A fileset named anonymizer-inputs with anonymizer-input.csv uploaded.

What run Does

anonymizer.run executes the full Anonymizer pipeline on every record of an input file and writes the output as job artifacts.

There are three run commands:

CommandWhere it runsLocal pathsmodel_configs requiredArtifacts
nemo anonymizer run runLocal CLI process via generated is_local pathAllowedOptionalWritten under persistent/results/artifacts locally
nemo anonymizer run submitNeMo Platform Jobs workerRejectedRequiredStored in NeMo Platform job artifact storage; pull with download_artifacts()
nemo anonymizer run explainLocal schema introspectionn/an/aPrints job key, submit endpoint, and input/spec schemas

Job artifacts (under the artifacts/ directory):

FileDescription
dataset.parquetUser-facing anonymized dataframe (replace/rewrite output).
trace.parquetInternal trace dataframe with detection details.
metadata.jsonRun metadata (includes the original text column name).
failed_records.jsonPer-record failures with reasons. Only written when records failed.

Step 1: Build an AnonymizerRequest

AnonymizerRequest contains the execution fields shared by preview and run (config, data, model_configs, and selected_models). A run processes the full input file, so it does not include num_records:

1import os
2from anonymizer.config.anonymizer_config import AnonymizerConfig
3from anonymizer.config.replace_strategies import Redact
4from data_designer.config import ModelConfig
5from nemo_anonymizer_plugin.app.input import AnonymizerInputSpec
6from nemo_anonymizer_plugin.app.task_config import AnonymizerRequest
7
8WORKSPACE = os.environ.get("NMP_WORKSPACE", "default")
9MODEL_PROVIDER = os.environ.get("NMP_ANON_PROVIDER", "nvidia-build")
10
11config = AnonymizerConfig(
12 replace=Redact(format_template="[REDACTED_{label}]"),
13)
14
15model_configs = [
16 ModelConfig(alias="gliner-pii-detector", provider=MODEL_PROVIDER, model="nvidia/gliner-pii"),
17 ModelConfig(alias="gpt-oss-120b", provider=MODEL_PROVIDER, model="openai/gpt-oss-120b"),
18 ModelConfig(alias="nemotron-30b-thinking", provider=MODEL_PROVIDER, model="nvidia/nemotron-3-nano-30b-a3b"),
19]
20
21request = AnonymizerRequest(
22 config=config,
23 data=AnonymizerInputSpec(
24 source=f"fileset://{WORKSPACE}/anonymizer-inputs#anonymizer-input.csv",
25 text_column="biography",
26 id_column="id",
27 ),
28 model_configs=model_configs,
29)

Step 2: Write the Spec to YAML

The CLI run commands read a YAML spec file. Serialize the AnonymizerRequest directly:

1import yaml
2from pathlib import Path
3
4spec_path = Path("/tmp/anonymizer-run.yaml")
5spec_path.write_text(yaml.safe_dump(request.model_dump(mode="json", exclude_none=True)))

Step 3: Run the Job

Choose one execution path. Option A runs in the local CLI process. Option B submits the same request to the NeMo Platform Jobs worker.

Option A: Run Locally

$nemo anonymizer run run --spec-file /tmp/anonymizer-run.yaml

The local job context runs the Anonymizer library Anonymizer.run(...) in-process, then writes artifacts through the generated local job results manager.

Expected output:

1{"exit_code": 0}

run run does not echo the artifact path on stdout. The local job results manager logs the path to stderr in the form:

Saved result 'artifacts' to file:///.../persistent/results/artifacts

Use that path in the next step.

Option B: Submit to the Jobs Worker

To execute the same spec on the NeMo Platform Jobs worker instead of in the CLI process, use run submit:

$nemo anonymizer run submit \
> --spec-file /tmp/anonymizer-run.yaml \
> --workspace "${NMP_WORKSPACE:-default}" \
> --base-url "${NMP_BASE_URL:-http://localhost:8080}"

The command prints the assigned job name. You need that name to poll status and download artifacts in Step 4.

The SDK equivalent is sdk.anonymizer.run(request). It posts the request to the plugin’s /jobs/run endpoint and returns an AnonymizerJobResource:

1import os
2from nemo_platform import NeMoPlatform
3
4sdk = NeMoPlatform(
5 base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
6 workspace=WORKSPACE,
7)
8job = sdk.anonymizer.run(request)

Compared to run run, the submit path:

  • Rejects local file paths in data.source — use a fileset reference (<fileset>#<path>) or http(s) URL.
  • Requires explicit model_configs referencing Inference Gateway providers, because the job runs outside the CLI process and cannot inherit Data Designer’s locally-defined providers.

Step 4: Get Results

Option A Results: Local Run

For run run, the result already exists on the local filesystem. Use the artifact directory printed in stderr:

$ARTIFACTS_DIR=/path/to/persistent/results/artifacts
$ls "$ARTIFACTS_DIR"

Then load the parquet artifacts from that directory:

1import json
2from pathlib import Path
3
4import pandas as pd
5
6artifacts_dir = Path("/path/to/persistent/results/artifacts") # from the stderr log
7
8metadata = json.loads((artifacts_dir / "metadata.json").read_text())
9dataset = pd.read_parquet(artifacts_dir / "dataset.parquet", dtype_backend="pyarrow")
10trace = pd.read_parquet(artifacts_dir / "trace.parquet", dtype_backend="pyarrow")
11
12failed_path = artifacts_dir / "failed_records.json"
13failed_records = json.loads(failed_path.read_text()) if failed_path.exists() else []
14
15print(dataset.head())
16print(f"records={len(dataset)} failures={len(failed_records)}")

The trace dataset (and the dataset itself for annotate / substitute strategies) contains pyarrow-backed struct<entities: list<...>> columns. If you need plain Python dict/list values for JSON output, use pyarrow.parquet:

1import pyarrow.parquet as pq
2
3table = pq.read_table(artifacts_dir / "dataset.parquet")
4records = table.slice(0, 5).to_pylist()

Option B Results: Remote Run

For run submit, track the platform job first. The job is ready for artifact download when its status is completed:

$# Replace with the job name printed by `run submit`.
$nemo jobs get-status <job-name> --workspace "${NMP_WORKSPACE:-default}"
$nemo jobs get-logs <job-name> --workspace "${NMP_WORKSPACE:-default}"

To download from the CLI, fetch the artifacts result and extract it:

$nemo jobs results download artifacts \
> --job <job-name> \
> --workspace "${NMP_WORKSPACE:-default}" \
> --output-file /tmp/anonymizer-artifacts.tar.gz
$
$mkdir -p /tmp/anonymizer-artifacts
$tar -xzf /tmp/anonymizer-artifacts.tar.gz -C /tmp/anonymizer-artifacts
$ls /tmp/anonymizer-artifacts/artifacts

Then point AnonymizerJobResults at the extracted artifacts directory:

1from pathlib import Path
2
3from nemo_anonymizer_plugin.sdk.job_results import AnonymizerJobResults
4
5results = AnonymizerJobResults(Path("/tmp/anonymizer-artifacts/artifacts"))
6
7dataset = results.load_dataset()
8trace = results.load_trace()
9failed = results.load_failed_records()

If you used the SDK, use the AnonymizerJobResource methods directly. get_job_status() reads the current status, check_if_complete() tests whether artifacts are ready, wait_until_done() blocks until a terminal state, and download_artifacts() downloads and extracts the result:

1job = sdk.anonymizer.run(request)
2
3status = job.get_job_status()
4is_done = job.check_if_complete()
5
6job.wait_until_done()
7results = job.download_artifacts()
8
9dataset = results.load_dataset()
10trace = results.load_trace()
11failed = results.load_failed_records()

AnonymizerJobResults exposes load_dataset(), load_trace(), load_failed_records(), and display_record() over the same underlying files. See SDK Resources.

Inspect the Schema Without Running

run explain prints the job key, submit endpoint, and JSON schemas for AnonymizerRequest and the canonical AnonymizerStepConfig:

$nemo anonymizer run explain

This is useful when authoring a spec programmatically or wiring the job into another tool.

How the Job Compiles

For each request, the plugin:

  1. Validates the Anonymizer library AnonymizerConfig.
  2. Validates the input source (rejects local paths on remote execution; checks fileset refs).
  3. Validates that selected_models overrides also have model_configs.
  4. Resolves model_configs providers — locally-defined Data Designer providers first, then Inference Gateway providers. Remote execution (run submit) resolves only through the Inference Gateway.
  5. Renders a unified model_configs YAML body for the library.
  6. Stores the resolved providers and YAML in the internal AnonymizerStepConfig consumed by the worker (in-process for run run, or on the Jobs worker for run submit).

For run submit, provider endpoints are re-resolved at runtime so the job uses the in-cluster Inference Gateway address rather than the address captured at submission time.

Next Steps

  • Iterate faster with preview before scaling to a full job.
  • Refer to SDK Resources for AnonymizerJobResource and AnonymizerJobResults details.
  • Replacement strategy parameters and rewrite mode are documented in the library docs.