@nemo-nb: hide
@nemo-nb: hide
Define and Run Custom Python Metrics
Custom Python metrics let you score model outputs with deterministic, domain-specific logic that is easier to express in code than in a generic metric or an LLM-as-a-judge prompt.
This tutorial shows how to evaluate a model that solves arithmetic word problems by returning a Python expression. This response format is useful for calculator-style systems because it makes the model output executable and easy to verify. You will define a metric that checks whether each expression is safe to evaluate and whether it produces the expected answer, test the metric locally, then submit the same metric as a durable Evaluator service job with a FilesetRef dataset and a ModelRef target.
What you will learn:
- Define a Python metric with the Evaluator SDK metric protocol
- Score model outputs with standard-library Python logic
- Run the same metric through the local Evaluator SDK
- Map dataset prediction columns into the evaluator’s candidate output field
- Package the metric with the Cloudpickle metric bundle packager for remote execution
- Submit a durable Evaluator service job with
FilesetRefandModelRef - Inspect aggregate and row-level metric results
Keep custom metric code dependency-light. Local execution runs your metric in the current Python environment, and remote execution hydrates the serialized metric in the evaluator job runtime. This tutorial uses only the Python standard library.
Prerequisites
Install the Evaluator SDK:
Verify that the SDK imports:
You do not need a running NeMo Platform instance to define the metric or run it locally. Start NeMo Platform and configure platform resources before submitting the durable remote job.
1. Understand the Metric Contract
A custom metric implements the Metric protocol from nemo_evaluator_sdk. The protocol has one identifier property and two methods:
type: a public metric identifier used in result names and logsoutput_spec(): declares every row-level output the metric can emitcompute_scores(...): scores one dataset row and one candidate output
compute_scores(...) receives a MetricInput object:
metric_input.row.datacontains the original dataset row, including any canonical fields produced by field mapping.metric_input.candidate.output_textcontains the candidate output. For offline evaluations, the Evaluator SDK can populate this from a mapped dataset column. For online evaluations, it contains the generated model output.
The method returns a MetricResult whose outputs match the names declared by output_spec(). Output names must be stable because they become aggregate score names in the final result.
Start with the smallest possible shape:
You will fill in output_spec() first, then the scoring logic.
2. Declare the Outputs
This metric should answer two yes-or-no questions for each model output:
valid_expression:Truewhen the output is a safe arithmetic expression, otherwiseFalsecorrect_value:Truewhen the expression evaluates to the expected answer, otherwiseFalse
Declare those outputs with MetricOutputSpec.boolean(...):
These names are part of the result contract. Each row result must emit exactly these outputs, and the aggregate result prefixes them with the metric identifier, such as arithmetic-expression-correctness.correct_value. Boolean outputs aggregate as rates, so the aggregate mean for correct_value is the fraction of rows that were correct.
3. Write a Restricted Expression Evaluator
The metric needs to evaluate model output, but it must not execute arbitrary Python. ast.parse(...) parses Python source into an abstract syntax tree without executing it, but it is not a complete sandbox by itself. For untrusted model output, keep inputs bounded and evaluate only an explicit allowlist of AST node types.
This helper limits expression length, limits AST size, and allows only numeric constants and arithmetic operators.
This helper rejects function calls, names, attributes, comprehensions, imports, and other Python syntax because _evaluate_ast(...) raises on any node type it does not explicitly support.
4. Put the Metric Together
Now combine the protocol methods and the safe evaluator.
The scoring method uses metric_input.candidate.output_text, where the evaluator stores the candidate output. The metric does not need to look for dataset-specific prediction columns or implement a row-level fallback. For offline rows, use field mapping to normalize your prediction column into the evaluator’s canonical output field before the metric runs. If the evaluator does not provide a candidate output, the metric treats the row as a failure.
5. Run the Metric with the Local Evaluator
Run the metric through the Evaluator SDK so you exercise the normal dataset loading, metric execution, aggregation, and result objects before submitting a service-side job.
For an offline evaluation, datasets often store predictions under task-specific column names such as model_expression, answer, or prediction. Use FieldMapping to map that column to the evaluator’s canonical output field. The Evaluator SDK then passes it to your metric as metric_input.candidate.output_text.
The local dataset includes correct expressions, a valid expression with the wrong value, and an invalid expression so you can see both metric outputs vary.
With this mapping in place, every row still preserves its original model_expression field in metric_input.row.data, and the metric receives the normalized candidate output through metric_input.candidate.output_text.
6. Prepare Platform Resources
To run the same metric remotely, install and start NeMo Platform using the Setup guide. You also need:
- A model entity that can be referenced as
workspace/model-name. See Model Configuration for model setup details. - A platform secret for the model, if the model requires an API key.
Create a Workspace
Create a workspace for the tutorial. If it already exists, continue using it.
Register a Dataset Fileset
The dataset contains word problems with the expected numeric answer. The model will generate a Python arithmetic expression, and the custom metric will evaluate that expression.
The nemo_evaluator_sdk package provides metric and runtime types, such as Metric, RunConfigOnlineModel, and ModelRef. The nemo_evaluator.sdk package provides platform submission helpers, such as FilesetRef, that are specific to durable evaluator jobs.
7. Submit a Durable Evaluator Job
Pass CloudpickleMetricBundlePackager() to client.evaluator.submit(...) so the SDK serializes the metric object into the evaluator job spec. The job runtime hydrates the metric bundle before scoring rows. Use ModelRef to reference the platform model entity you configured with NMP_EVAL_MODEL_REF.
Cloudpickle metric bundles execute serialized Python code when the job hydrates the metric. Use this path only for metric code that you fully understand and trust.
8. Inspect Results
The aggregate scores show how often the outputs followed the expression contract and how often the expression evaluated to the expected value.
If valid_expression is low, inspect row scores before changing the metric. Some models return the right expression followed by explanation text, which is not a valid Python expression and should fail this metric.
Use row scores to debug the expressions:
Use the local result object from Step 5 the same way if you want to inspect the local run instead.
Best Practices
- Run custom metrics locally with representative rows before submitting service-side jobs.
- Keep metric code deterministic and side-effect free.
- Prefer Python standard-library logic unless you know the dependency is available in the service runtime.
- Emit separate outputs for format validity and task correctness so failures are easier to diagnose.
- Pass an explicit metric bundle packager when submitting custom Python metrics as durable jobs.
- Use
FilesetReffor reusable datasets andModelReffor platform-managed model routing.
Next Steps
- Learn about built-in metric types in Evaluation Metrics.
- Learn how to configure model targets in Model Configuration.
- Learn how to inspect result artifacts in Evaluation Results.