Define and Run Custom Python Metrics

Custom Python metrics let you score model outputs with deterministic, domain-specific logic that is easier to express in code than in a generic metric or an LLM-as-a-judge prompt.

This tutorial shows how to evaluate a model that solves arithmetic word problems by returning a Python expression. This response format is useful for calculator-style systems because it makes the model output executable and easy to verify. You will define a metric that checks whether each expression is safe to evaluate and whether it produces the expected answer, test the metric locally, then submit the same metric as a durable Evaluator service job with a FilesetRef dataset and a ModelRef target.

What you will learn:

Define a Python metric with the Evaluator SDK metric protocol
Score model outputs with standard-library Python logic
Run the same metric through the local Evaluator SDK
Map dataset prediction columns into the evaluator’s candidate output field
Package the metric with the Cloudpickle metric bundle packager for remote execution
Submit a durable Evaluator service job with FilesetRef and ModelRef
Inspect aggregate and row-level metric results

Keep custom metric code dependency-light. Local execution runs your metric in the current Python environment, and remote execution hydrates the serialized metric in the evaluator job runtime. This tutorial uses only the Python standard library.

Prerequisites

Install the Evaluator SDK:

$ pip install "nemo-platform[all]"

Verify that the SDK imports:

1 import nemo_evaluator_sdk
2 
3 
4 print(nemo_evaluator_sdk.version)

You do not need a running NeMo Platform instance to define the metric or run it locally. Start NeMo Platform and configure platform resources before submitting the durable remote job.

1. Understand the Metric Contract

A custom metric implements the Metric protocol from nemo_evaluator_sdk. The protocol has one identifier property and two methods:

type: a public metric identifier used in result names and logs
output_spec(): declares every row-level output the metric can emit
compute_scores(...): scores one dataset row and one candidate output

compute_scores(...) receives a MetricInput object:

metric_input.row.data contains the original dataset row, including any canonical fields produced by field mapping.
metric_input.candidate.output_text contains the candidate output. For offline evaluations, the Evaluator SDK can populate this from a mapped dataset column. For online evaluations, it contains the generated model output.

The method returns a MetricResult whose outputs match the names declared by output_spec(). Output names must be stable because they become aggregate score names in the final result.

Start with the smallest possible shape:

1 from nemo_evaluator_sdk import Metric, MetricInput, MetricResult
2 
3 
4 class ArithmeticExpressionCorrectnessMetric(Metric):
5     type = "arithmetic-expression-correctness"
6 
7     def output_spec(self): ...
8 
9     async def compute_scores(self, metric_input: MetricInput) -> MetricResult: ...

You will fill in output_spec() first, then the scoring logic.

2. Declare the Outputs

This metric should answer two yes-or-no questions for each model output:

valid_expression: True when the output is a safe arithmetic expression, otherwise False
correct_value: True when the expression evaluates to the expected answer, otherwise False

Declare those outputs with MetricOutputSpec.boolean(...):

1 from nemo_evaluator_sdk import MetricOutputSpec
2 
3 
4 def output_spec(self) -> list[MetricOutputSpec]:
5     return [
6         MetricOutputSpec.boolean("valid_expression"),
7         MetricOutputSpec.boolean("correct_value"),
8     ]

These names are part of the result contract. Each row result must emit exactly these outputs, and the aggregate result prefixes them with the metric identifier, such as arithmetic-expression-correctness.correct_value. Boolean outputs aggregate as rates, so the aggregate mean for correct_value is the fraction of rows that were correct.

3. Write a Restricted Expression Evaluator

The metric needs to evaluate model output, but it must not execute arbitrary Python. ast.parse(...) parses Python source into an abstract syntax tree without executing it, but it is not a complete sandbox by itself. For untrusted model output, keep inputs bounded and evaluate only an explicit allowlist of AST node types.

This helper limits expression length, limits AST size, and allows only numeric constants and arithmetic operators.

1 import ast
2 import operator
3 from collections.abc import Callable
4 
5 
6 MAX_EXPRESSION_CHARS = 256
7 MAX_AST_NODES = 64
8 
9 _BINARY_OPERATORS: dict[type[ast.operator], Callable[[float, float], float]] = {
10     ast.Add: operator.add,
11     ast.Sub: operator.sub,
12     ast.Mult: operator.mul,
13     ast.Div: operator.truediv,
14     ast.FloorDiv: operator.floordiv,
15     ast.Mod: operator.mod,
16 }
17 
18 _UNARY_OPERATORS: dict[type[ast.unaryop], Callable[[float], float]] = {
19     ast.UAdd: operator.pos,
20     ast.USub: operator.neg,
21 }
22 
23 
24 def _validate_expression_size(expression: str, parsed: ast.AST) -> None:
25     if len(expression) > MAX_EXPRESSION_CHARS:
26         raise ValueError("expression is too long")
27 
28     if sum(1 for _ in ast.walk(parsed)) > MAX_AST_NODES:
29         raise ValueError("expression is too complex")
30 
31 
32 def _evaluate_ast(node: ast.AST) -> float:
33     if isinstance(node, ast.Expression):
34         return _evaluate_ast(node.body)
35 
36     if isinstance(node, ast.Constant) and isinstance(node.value, (int, float)) and not isinstance(node.value, bool):
37         return float(node.value)
38 
39     if isinstance(node, ast.BinOp):
40         operator_fn = _BINARY_OPERATORS.get(type(node.op))
41         if operator_fn is None:
42             raise ValueError(f"unsupported operator: {type(node.op).__name__}")
43         return operator_fn(_evaluate_ast(node.left), _evaluate_ast(node.right))
44 
45     if isinstance(node, ast.UnaryOp):
46         operator_fn = _UNARY_OPERATORS.get(type(node.op))
47         if operator_fn is None:
48             raise ValueError(f"unsupported unary operator: {type(node.op).__name__}")
49         return operator_fn(_evaluate_ast(node.operand))
50 
51     raise ValueError(f"unsupported expression: {type(node).__name__}")
52 
53 
54 def safe_eval_math_expression(expression: str) -> float:
55     expression = expression.strip()
56     parsed = ast.parse(expression, mode="eval")
57     _validate_expression_size(expression, parsed)
58     return _evaluate_ast(parsed)

This helper rejects function calls, names, attributes, comprehensions, imports, and other Python syntax because _evaluate_ast(...) raises on any node type it does not explicitly support.

4. Put the Metric Together

Now combine the protocol methods and the safe evaluator.

The scoring method uses metric_input.candidate.output_text, where the evaluator stores the candidate output. The metric does not need to look for dataset-specific prediction columns or implement a row-level fallback. For offline rows, use field mapping to normalize your prediction column into the evaluator’s canonical output field before the metric runs. If the evaluator does not provide a candidate output, the metric treats the row as a failure.

1 import math
2 
3 from nemo_evaluator_sdk import (
4     Metric,
5     MetricInput,
6     MetricOutput,
7     MetricOutputSpec,
8     MetricResult,
9 )
10 
11 
12 class ArithmeticExpressionCorrectnessMetric(Metric):
13     type = "arithmetic-expression-correctness"
14 
15     def output_spec(self) -> list[MetricOutputSpec]:
16         return [
17             MetricOutputSpec.boolean("valid_expression"),
18             MetricOutputSpec.boolean("correct_value"),
19         ]
20 
21     async def compute_scores(self, metric_input: MetricInput) -> MetricResult:
22         expression = metric_input.candidate.output_text
23         if not expression:
24             return MetricResult(
25                 outputs=[
26                     MetricOutput(name="valid_expression", value=False),
27                     MetricOutput(name="correct_value", value=False),
28                 ]
29             )
30 
31         try:
32             actual = safe_eval_math_expression(expression)
33         except (SyntaxError, ValueError, TypeError, ZeroDivisionError, OverflowError, RecursionError):
34             return MetricResult(
35                 outputs=[
36                     MetricOutput(name="valid_expression", value=False),
37                     MetricOutput(name="correct_value", value=False),
38                 ]
39             )
40 
41         expected = float(metric_input.row.data["expected"])
42         tolerance = float(metric_input.row.data.get("tolerance", 1e-6))
43         correct_value = math.isclose(actual, expected, rel_tol=tolerance, abs_tol=tolerance)
44 
45         return MetricResult(
46             outputs=[
47                 MetricOutput(name="valid_expression", value=True),
48                 MetricOutput(name="correct_value", value=correct_value),
49             ]
50         )

5. Run the Metric with the Local Evaluator

Run the metric through the Evaluator SDK so you exercise the normal dataset loading, metric execution, aggregation, and result objects before submitting a service-side job.

For an offline evaluation, datasets often store predictions under task-specific column names such as model_expression, answer, or prediction. Use FieldMapping to map that column to the evaluator’s canonical output field. The Evaluator SDK then passes it to your metric as metric_input.candidate.output_text.

The local dataset includes correct expressions, a valid expression with the wrong value, and an invalid expression so you can see both metric outputs vary.

1 from nemo_evaluator_sdk import Evaluator, FieldMapping, RunConfig
2 
3 
4 dataset = [
5     {
6         "question": "A box has 12 rows of pencils with 4 pencils in each row. Then 7 pencils are added. How many pencils are there?",
7         "expected": 55,
8         "tolerance": 1e-6,
9         "model_expression": "(12 * 4) + 7",
10     },
11     {
12         "question": "A server processed 125 requests, then processed 3 more batches of 25 requests. How many requests were processed?",
13         "expected": 200,
14         "tolerance": 1e-6,
15         "model_expression": "125 + 25",
16     },
17     {
18         "question": "A tank starts with 90 liters and loses 18 liters each hour for 3 hours. How many liters remain?",
19         "expected": 36,
20         "tolerance": 1e-6,
21         "model_expression": "90 - (18 * 3)",
22     },
23     {
24         "question": "A package contains 12 items and 4 packages are used. How many items are used?",
25         "expected": 48,
26         "tolerance": 1e-6,
27         "model_expression": "__import__('os').system('echo nope')",
28     },
29 ]
30 
31 metric = ArithmeticExpressionCorrectnessMetric()
32 evaluator = Evaluator()
33 result = evaluator.run_sync(
34     metrics=metric,
35     dataset=dataset,
36     config=RunConfig(parallelism=1),
37     field_mapping=FieldMapping(output="model_expression"),
38 )
39 
40 result.print_summary()

With this mapping in place, every row still preserves its original model_expression field in metric_input.row.data, and the metric receives the normalized candidate output through metric_input.candidate.output_text.

6. Prepare Platform Resources

To run the same metric remotely, install and start NeMo Platform using the Setup guide. You also need:

A model entity that can be referenced as workspace/model-name. See Model Configuration for model setup details.
A platform secret for the model, if the model requires an API key.

1 import os
2 
3 from nemo_platform import NeMoPlatform
4 
5 
6 WORKSPACE = "custom-python-metrics"
7 MODEL_REF = os.environ.get("NMP_EVAL_MODEL_REF", "default/my-model")
8 
9 client = NeMoPlatform(
10     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
11     workspace=WORKSPACE,
12 )

Create a Workspace

Create a workspace for the tutorial. If it already exists, continue using it.

1 from nemo_platform import ConflictError
2 
3 
4 try:
5     client.workspaces.create(name=WORKSPACE)
6     print(f"Workspace '{WORKSPACE}' created")
7 except ConflictError:
8     print(f"Workspace '{WORKSPACE}' already exists, continuing...")

Register a Dataset Fileset

The dataset contains word problems with the expected numeric answer. The model will generate a Python arithmetic expression, and the custom metric will evaluate that expression.

1 import json
2 from pathlib import Path
3 
4 from nemo_evaluator.sdk import FilesetRef
5 
6 
7 DATASET_NAME = "arithmetic-expression-data"
8 DATASET_FILE = "math-expressions.jsonl"
9 
10 remote_rows = [
11     {
12         "question": "A box has 12 rows of pencils with 4 pencils in each row. Then 7 pencils are added. How many pencils are there?",
13         "expected": 55,
14         "tolerance": 1e-6,
15     },
16     {
17         "question": "A server processed 125 requests, then processed 3 more batches of 25 requests. How many requests were processed?",
18         "expected": 200,
19         "tolerance": 1e-6,
20     },
21     {
22         "question": "A tank starts with 90 liters and loses 18 liters each hour for 3 hours. How many liters remain?",
23         "expected": 36,
24         "tolerance": 1e-6,
25     },
26 ]
27 
28 dataset_path = Path(DATASET_FILE)
29 dataset_path.write_text(
30     "".join(json.dumps(row) + "\n" for row in remote_rows),
31     encoding="utf-8",
32 )
33 
34 try:
35     fileset = client.files.filesets.create(
36         name=DATASET_NAME,
37         description="Math expression evaluation dataset",
38         purpose="dataset",
39         metadata={
40             "dataset": {
41                 "schema": {
42                     "type": "object",
43                     "properties": {
44                         "question": {"type": "string"},
45                         "expected": {"type": "number"},
46                         "tolerance": {"type": "number"},
47                     },
48                     "required": ["question", "expected"],
49                     "additionalProperties": True,
50                 }
51             }
52         },
53     )
54     print(f"Created fileset: {fileset.workspace}/{fileset.name}")
55 except ConflictError:
56     fileset = client.files.filesets.retrieve(name=DATASET_NAME)
57     print(f"Fileset exists: {fileset.workspace}/{fileset.name}")
58 
59 client.files.upload(
60     fileset=fileset.name,
61     local_path=str(dataset_path),
62     remote_path=DATASET_FILE,
63 )
64 
65 dataset_ref = FilesetRef(root=f"{fileset.workspace}/{fileset.name}").with_fragment(DATASET_FILE)
66 print(f"Using dataset: {dataset_ref.root}")

The nemo_evaluator_sdk package provides metric and runtime types, such as Metric, RunConfigOnlineModel, and ModelRef. The nemo_evaluator.sdk package provides platform submission helpers, such as FilesetRef, that are specific to durable evaluator jobs.

7. Submit a Durable Evaluator Job

Pass CloudpickleMetricBundlePackager() to client.evaluator.submit(...) so the SDK serializes the metric object into the evaluator job spec. The job runtime hydrates the metric bundle before scoring rows. Use ModelRef to reference the platform model entity you configured with NMP_EVAL_MODEL_REF.

1 from nemo_evaluator.shared.metric_bundles.cloudpickle import CloudpickleMetricBundlePackager
2 from nemo_evaluator_sdk import InferenceParams, ModelRef, RunConfigOnlineModel
3 
4 
5 job = client.evaluator.submit(
6     metric=metric,
7     dataset=dataset_ref,
8     config=RunConfigOnlineModel(
9         parallelism=2,
10         limit_samples=3,
11         inference=InferenceParams(
12             temperature=0.0,
13             max_tokens=32,
14         ),
15     ),
16     target=ModelRef(root=MODEL_REF),
17     prompt_template=(
18         "Return exactly one valid Python arithmetic expression that evaluates to the answer.\n"
19         "Start immediately with the expression. Do not start with a newline.\n"
20         "Do not include markdown, code fences, prose, units, the final answer, or any explanation.\n"
21         "Use only numbers, whitespace, parentheses, and these operators: +, -, *, /, //, %.\n\n"
22         "Question: {{item.question}}\n"
23         "Expression:"
24     ),
25     metric_bundle_packager=CloudpickleMetricBundlePackager(),
26 )
27 
28 print(f"Submitted job: {job.name}")
29 job.wait_until_done()
30 remote_result = job.get_result()
31 remote_result.print_summary()

Cloudpickle metric bundles execute serialized Python code when the job hydrates the metric. Use this path only for metric code that you fully understand and trust.

8. Inspect Results

The aggregate scores show how often the outputs followed the expression contract and how often the expression evaluated to the expected value.

1 for score in remote_result.aggregate_scores.scores:
2     print(f"{score.name}: mean={score.mean}, count={score.count}, nan_count={score.nan_count}")

If valid_expression is low, inspect row scores before changing the metric. Some models return the right expression followed by explanation text, which is not a valid Python expression and should fail this metric.

Use row scores to debug the expressions:

1 for row in remote_result.row_scores:
2     print("Question:", row.item["question"])
3     print("Model output:", row.sample.get("output_text"))
4     print("Metric outputs:", row.metrics)
5     print()

Use the local result object from Step 5 the same way if you want to inspect the local run instead.

Best Practices

Run custom metrics locally with representative rows before submitting service-side jobs.
Keep metric code deterministic and side-effect free.
Prefer Python standard-library logic unless you know the dependency is available in the service runtime.
Emit separate outputs for format validity and task correctness so failures are easier to diagnose.
Pass an explicit metric bundle packager when submitting custom Python metrics as durable jobs.
Use FilesetRef for reusable datasets and ModelRef for platform-managed model routing.

Next Steps

Learn about built-in metric types in Evaluation Metrics.
Learn how to configure model targets in Model Configuration.
Learn how to inspect result artifacts in Evaluation Results.