Anonymizer NeMo Platform SDK Resources

View as Markdown

The anonymizer.config module (from the NVIDIA NeMo Anonymizer library) builds AnonymizerConfig objects in a context-agnostic way. Once you are ready to execute that config against the NeMo Platform Anonymizer service, you use objects from the nemo_platform SDK. This page describes the NeMo Platform-specific objects.

AnonymizerResource

The AnonymizerResource is the entry point for working with Anonymizer on NeMo Platform. It wraps the streaming preview endpoint and job submission for the plugin service.

A AnonymizerResource is accessed directly from a NeMoPlatform instance:

1import os
2from nemo_platform import NeMoPlatform
3
4sdk = NeMoPlatform(
5 base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
6 workspace="default",
7)
8anonymizer = sdk.anonymizer # AnonymizerResource

An AsyncAnonymizerResource with the same surface is available via AsyncNeMoPlatform.anonymizer.

MethodDescription
preview(request, *, workspace=None)Runs a streaming preview against the plugin service and returns an AnonymizerPreviewResult after the stream completes.
run(request, *, workspace=None, wait_until_done=False)Submits an anonymizer.run job to the NeMo Platform Jobs worker. Returns an AnonymizerJobResource. When wait_until_done=True, blocks until the job reaches a terminal state.
get_job_resource(job_name, workspace=None)Returns an AnonymizerJobResource for an existing job (by job name).

request is a PreviewRequest or AnonymizerRequest instance from nemo_anonymizer_plugin.app.task_config. Both accept the same config, data, model_configs, and selected_models fields; PreviewRequest adds num_records.

Both preview and run call the plugin service, so they require model_configs and reject local file paths in data.source — use a fileset reference or http(s) URL.

AnonymizerPreviewResult

AnonymizerResource.preview collects the frame stream and returns an AnonymizerPreviewResult once the stream completes.

Attribute / MethodDescription
datasetpandas.DataFrame of anonymized records (the preview_dataset frame contents).
trace_datasetpandas.DataFrame with detection trace columns (the trace_dataset frame contents).
failed_recordslist[dict] of per-record failures with reasons. Empty when nothing failed.
display_record(index=None)Renders a single trace record as HTML in a notebook. When index is omitted, cycles through records.

AnonymizerPreviewResult holds everything in memory; nothing is persisted to disk by default. The dataset and trace_dataset fields are regular pandas DataFrames and can be saved with to_csv / to_parquet.

AnonymizerJobResource

AnonymizerResource.run returns an AnonymizerJobResource. You can also use AnonymizerResource.get_job_resource to get one for an existing job.

1job = sdk.anonymizer.run(run_request)
2job.wait_until_done()
3results = job.download_artifacts()
4dataset = results.load_dataset()
MethodDescription
get_job()Returns the raw job record from the jobs service.
get_job_status()Returns the current PlatformJobStatus.
check_if_complete(*, raise_if_not_complete=False)Returns True when the job is completed. Returns False (or raises) for terminal incomplete and running states.
wait_until_done()Polls the jobs service until the job reaches a terminal state. Logs progress as it goes.
get_logs()Returns logs from the job as a list of dicts. Handles pagination automatically.
download_artifacts(path=None)Downloads the job artifacts tarball and unarchives it. Returns an AnonymizerJobResults object.

The async variant (AsyncAnonymizerJobResource) exposes the same surface with async def methods.

AnonymizerJobResults

download_artifacts returns an AnonymizerJobResults object that loads parquet / JSON artifacts into memory. The same class also works for the local run run flow — point it at the artifact directory the local job results manager logs:

1from pathlib import Path
2from nemo_anonymizer_plugin.sdk.job_results import AnonymizerJobResults
3
4results = AnonymizerJobResults(Path("/path/to/persistent/results/artifacts"))
5dataset = results.load_dataset()
MethodDescription
load_dataset()Returns the anonymized dataset as a pandas.DataFrame (dataset.parquet).
load_trace()Returns the trace dataframe (trace.parquet). The original_text_column from metadata.json is attached for display_record.
load_failed_records()Returns failed_records.json as list[dict]. Returns [] when the file isn’t present.
display_record(index=None)Renders a single trace record as HTML in a notebook. When index is omitted, cycles through records.

AnonymizerJobResults reads files lazily — methods load the corresponding parquet or JSON only when called. The underlying directory layout is:

<artifacts_dir>/
dataset.parquet
trace.parquet
metadata.json
failed_records.json # only when there were failures

By default, download_artifacts saves the tarball contents to a local directory named after the job; pass path= to override.

Request Models

Both request models live in nemo_anonymizer_plugin.app.task_config.

Request Fields

AnonymizerRequest defines the execution fields below, run jobs use AnonymizerRequest directly and process the full input file.

FieldTypeDescription
configAnonymizerConfigUpstream library config (replace strategy or rewrite, detection params).
dataAnonymizerInputSpecInput source plus column metadata. See below.
model_configs`list[data_designer.config.ModelConfig] \None`
selected_models`SelectedModelsOverrides \None`

PreviewRequest extends AnonymizerRequest with num_records

FieldTypeDescription
configAnonymizerConfigUpstream library config (replace strategy or rewrite, detection params).
dataAnonymizerInputSpecInput source plus column metadata. See below.
model_configs`list[data_designer.config.ModelConfig] \None`
selected_models`SelectedModelsOverrides \None`
num_recordsint (≥ 1, default 10)Preview-only. Number of records to preview. Capped by the service’s preview_num_records.max.

AnonymizerInputSpec

The plugin-owned API-boundary input spec:

FieldTypeDescription
sourcestrLocal path, http(s) URL, or fileset reference for a CSV / Parquet file.
text_columnstr (default "text")Column containing text to anonymize.
id_column`str \None`
data_summary`str \None`

Fileset references can take any of the three forms fileset://<workspace>/<fileset>#<path>, <workspace>/<fileset>#<path>, or <fileset>#<path>, and must resolve to a single .csv or .parquet file.

SelectedModelsOverrides

Partial role → alias overrides for the three workflows. Each section is optional and is merged on top of the bundled default selection by the library.

FieldTypeDescription
detection`dict[str, str \list[str]] \
replace`dict[str, str] \None`
rewrite`dict[str, str] \None`

Supplying overrides without model_configs raises a config validation error.