@nemo-nb: hide

View as Markdown

Define and Run Custom Python Metrics

Custom Python metrics let you score model outputs with deterministic, domain-specific logic that is easier to express in code than in a generic metric or an LLM-as-a-judge prompt.

This tutorial shows how to evaluate a model that solves arithmetic word problems by returning a Python expression. This response format is useful for calculator-style systems because it makes the model output executable and easy to verify. You will define a metric that checks whether each expression is safe to evaluate and whether it produces the expected answer, test the metric locally, then submit the same metric as a durable Evaluator service job with a FilesetRef dataset and a ModelRef target.

What you will learn:

  • Define a Python metric with the Evaluator SDK metric protocol
  • Score model outputs with standard-library Python logic
  • Run the same metric through the local Evaluator SDK
  • Map dataset prediction columns into the evaluator’s candidate output field
  • Package the metric with the Cloudpickle metric bundle packager for remote execution
  • Submit a durable Evaluator service job with FilesetRef and ModelRef
  • Inspect aggregate and row-level metric results

Keep custom metric code dependency-light. Local execution runs your metric in the current Python environment, and remote execution hydrates the serialized metric in the evaluator job runtime. This tutorial uses only the Python standard library.

Prerequisites

Install the Evaluator SDK:

$pip install "nemo-platform[all]"

Verify that the SDK imports:

1import nemo_evaluator_sdk
2
3
4print(nemo_evaluator_sdk.version)

You do not need a running NeMo Platform instance to define the metric or run it locally. Start NeMo Platform and configure platform resources before submitting the durable remote job.

1. Understand the Metric Contract

A custom metric implements the Metric protocol from nemo_evaluator_sdk. The protocol has one identifier property and two methods:

  • type: a public metric identifier used in result names and logs
  • output_spec(): declares every row-level output the metric can emit
  • compute_scores(...): scores one dataset row and one candidate output

compute_scores(...) receives a MetricInput object:

  • metric_input.row.data contains the original dataset row, including any canonical fields produced by field mapping.
  • metric_input.candidate.output_text contains the candidate output. For offline evaluations, the Evaluator SDK can populate this from a mapped dataset column. For online evaluations, it contains the generated model output.

The method returns a MetricResult whose outputs match the names declared by output_spec(). Output names must be stable because they become aggregate score names in the final result.

Start with the smallest possible shape:

1from nemo_evaluator_sdk import Metric, MetricInput, MetricResult
2
3
4class ArithmeticExpressionCorrectnessMetric(Metric):
5 type = "arithmetic-expression-correctness"
6
7 def output_spec(self): ...
8
9 async def compute_scores(self, metric_input: MetricInput) -> MetricResult: ...

You will fill in output_spec() first, then the scoring logic.

2. Declare the Outputs

This metric should answer two yes-or-no questions for each model output:

  • valid_expression: True when the output is a safe arithmetic expression, otherwise False
  • correct_value: True when the expression evaluates to the expected answer, otherwise False

Declare those outputs with MetricOutputSpec.boolean(...):

1from nemo_evaluator_sdk import MetricOutputSpec
2
3
4def output_spec(self) -> list[MetricOutputSpec]:
5 return [
6 MetricOutputSpec.boolean("valid_expression"),
7 MetricOutputSpec.boolean("correct_value"),
8 ]

These names are part of the result contract. Each row result must emit exactly these outputs, and the aggregate result prefixes them with the metric identifier, such as arithmetic-expression-correctness.correct_value. Boolean outputs aggregate as rates, so the aggregate mean for correct_value is the fraction of rows that were correct.

3. Write a Restricted Expression Evaluator

The metric needs to evaluate model output, but it must not execute arbitrary Python. ast.parse(...) parses Python source into an abstract syntax tree without executing it, but it is not a complete sandbox by itself. For untrusted model output, keep inputs bounded and evaluate only an explicit allowlist of AST node types.

This helper limits expression length, limits AST size, and allows only numeric constants and arithmetic operators.

1import ast
2import operator
3from collections.abc import Callable
4
5
6MAX_EXPRESSION_CHARS = 256
7MAX_AST_NODES = 64
8
9_BINARY_OPERATORS: dict[type[ast.operator], Callable[[float, float], float]] = {
10 ast.Add: operator.add,
11 ast.Sub: operator.sub,
12 ast.Mult: operator.mul,
13 ast.Div: operator.truediv,
14 ast.FloorDiv: operator.floordiv,
15 ast.Mod: operator.mod,
16}
17
18_UNARY_OPERATORS: dict[type[ast.unaryop], Callable[[float], float]] = {
19 ast.UAdd: operator.pos,
20 ast.USub: operator.neg,
21}
22
23
24def _validate_expression_size(expression: str, parsed: ast.AST) -> None:
25 if len(expression) > MAX_EXPRESSION_CHARS:
26 raise ValueError("expression is too long")
27
28 if sum(1 for _ in ast.walk(parsed)) > MAX_AST_NODES:
29 raise ValueError("expression is too complex")
30
31
32def _evaluate_ast(node: ast.AST) -> float:
33 if isinstance(node, ast.Expression):
34 return _evaluate_ast(node.body)
35
36 if isinstance(node, ast.Constant) and isinstance(node.value, (int, float)) and not isinstance(node.value, bool):
37 return float(node.value)
38
39 if isinstance(node, ast.BinOp):
40 operator_fn = _BINARY_OPERATORS.get(type(node.op))
41 if operator_fn is None:
42 raise ValueError(f"unsupported operator: {type(node.op).__name__}")
43 return operator_fn(_evaluate_ast(node.left), _evaluate_ast(node.right))
44
45 if isinstance(node, ast.UnaryOp):
46 operator_fn = _UNARY_OPERATORS.get(type(node.op))
47 if operator_fn is None:
48 raise ValueError(f"unsupported unary operator: {type(node.op).__name__}")
49 return operator_fn(_evaluate_ast(node.operand))
50
51 raise ValueError(f"unsupported expression: {type(node).__name__}")
52
53
54def safe_eval_math_expression(expression: str) -> float:
55 expression = expression.strip()
56 parsed = ast.parse(expression, mode="eval")
57 _validate_expression_size(expression, parsed)
58 return _evaluate_ast(parsed)

This helper rejects function calls, names, attributes, comprehensions, imports, and other Python syntax because _evaluate_ast(...) raises on any node type it does not explicitly support.

4. Put the Metric Together

Now combine the protocol methods and the safe evaluator.

The scoring method uses metric_input.candidate.output_text, where the evaluator stores the candidate output. The metric does not need to look for dataset-specific prediction columns or implement a row-level fallback. For offline rows, use field mapping to normalize your prediction column into the evaluator’s canonical output field before the metric runs. If the evaluator does not provide a candidate output, the metric treats the row as a failure.

1import math
2
3from nemo_evaluator_sdk import (
4 Metric,
5 MetricInput,
6 MetricOutput,
7 MetricOutputSpec,
8 MetricResult,
9)
10
11
12class ArithmeticExpressionCorrectnessMetric(Metric):
13 type = "arithmetic-expression-correctness"
14
15 def output_spec(self) -> list[MetricOutputSpec]:
16 return [
17 MetricOutputSpec.boolean("valid_expression"),
18 MetricOutputSpec.boolean("correct_value"),
19 ]
20
21 async def compute_scores(self, metric_input: MetricInput) -> MetricResult:
22 expression = metric_input.candidate.output_text
23 if not expression:
24 return MetricResult(
25 outputs=[
26 MetricOutput(name="valid_expression", value=False),
27 MetricOutput(name="correct_value", value=False),
28 ]
29 )
30
31 try:
32 actual = safe_eval_math_expression(expression)
33 except (SyntaxError, ValueError, TypeError, ZeroDivisionError, OverflowError, RecursionError):
34 return MetricResult(
35 outputs=[
36 MetricOutput(name="valid_expression", value=False),
37 MetricOutput(name="correct_value", value=False),
38 ]
39 )
40
41 expected = float(metric_input.row.data["expected"])
42 tolerance = float(metric_input.row.data.get("tolerance", 1e-6))
43 correct_value = math.isclose(actual, expected, rel_tol=tolerance, abs_tol=tolerance)
44
45 return MetricResult(
46 outputs=[
47 MetricOutput(name="valid_expression", value=True),
48 MetricOutput(name="correct_value", value=correct_value),
49 ]
50 )

5. Run the Metric with the Local Evaluator

Run the metric through the Evaluator SDK so you exercise the normal dataset loading, metric execution, aggregation, and result objects before submitting a service-side job.

For an offline evaluation, datasets often store predictions under task-specific column names such as model_expression, answer, or prediction. Use FieldMapping to map that column to the evaluator’s canonical output field. The Evaluator SDK then passes it to your metric as metric_input.candidate.output_text.

The local dataset includes correct expressions, a valid expression with the wrong value, and an invalid expression so you can see both metric outputs vary.

1from nemo_evaluator_sdk import Evaluator, FieldMapping, RunConfig
2
3
4dataset = [
5 {
6 "question": "A box has 12 rows of pencils with 4 pencils in each row. Then 7 pencils are added. How many pencils are there?",
7 "expected": 55,
8 "tolerance": 1e-6,
9 "model_expression": "(12 * 4) + 7",
10 },
11 {
12 "question": "A server processed 125 requests, then processed 3 more batches of 25 requests. How many requests were processed?",
13 "expected": 200,
14 "tolerance": 1e-6,
15 "model_expression": "125 + 25",
16 },
17 {
18 "question": "A tank starts with 90 liters and loses 18 liters each hour for 3 hours. How many liters remain?",
19 "expected": 36,
20 "tolerance": 1e-6,
21 "model_expression": "90 - (18 * 3)",
22 },
23 {
24 "question": "A package contains 12 items and 4 packages are used. How many items are used?",
25 "expected": 48,
26 "tolerance": 1e-6,
27 "model_expression": "__import__('os').system('echo nope')",
28 },
29]
30
31metric = ArithmeticExpressionCorrectnessMetric()
32evaluator = Evaluator()
33result = evaluator.run_sync(
34 metrics=metric,
35 dataset=dataset,
36 config=RunConfig(parallelism=1),
37 field_mapping=FieldMapping(output="model_expression"),
38)
39
40result.print_summary()

With this mapping in place, every row still preserves its original model_expression field in metric_input.row.data, and the metric receives the normalized candidate output through metric_input.candidate.output_text.

6. Prepare Platform Resources

To run the same metric remotely, install and start NeMo Platform using the Setup guide. You also need:

  • A model entity that can be referenced as workspace/model-name. See Model Configuration for model setup details.
  • A platform secret for the model, if the model requires an API key.
1import os
2
3from nemo_platform import NeMoPlatform
4
5
6WORKSPACE = "custom-python-metrics"
7MODEL_REF = os.environ.get("NMP_EVAL_MODEL_REF", "default/my-model")
8
9client = NeMoPlatform(
10 base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
11 workspace=WORKSPACE,
12)

Create a Workspace

Create a workspace for the tutorial. If it already exists, continue using it.

1from nemo_platform import ConflictError
2
3
4try:
5 client.workspaces.create(name=WORKSPACE)
6 print(f"Workspace '{WORKSPACE}' created")
7except ConflictError:
8 print(f"Workspace '{WORKSPACE}' already exists, continuing...")

Register a Dataset Fileset

The dataset contains word problems with the expected numeric answer. The model will generate a Python arithmetic expression, and the custom metric will evaluate that expression.

1import json
2from pathlib import Path
3
4from nemo_evaluator.sdk import FilesetRef
5
6
7DATASET_NAME = "arithmetic-expression-data"
8DATASET_FILE = "math-expressions.jsonl"
9
10remote_rows = [
11 {
12 "question": "A box has 12 rows of pencils with 4 pencils in each row. Then 7 pencils are added. How many pencils are there?",
13 "expected": 55,
14 "tolerance": 1e-6,
15 },
16 {
17 "question": "A server processed 125 requests, then processed 3 more batches of 25 requests. How many requests were processed?",
18 "expected": 200,
19 "tolerance": 1e-6,
20 },
21 {
22 "question": "A tank starts with 90 liters and loses 18 liters each hour for 3 hours. How many liters remain?",
23 "expected": 36,
24 "tolerance": 1e-6,
25 },
26]
27
28dataset_path = Path(DATASET_FILE)
29dataset_path.write_text(
30 "".join(json.dumps(row) + "\n" for row in remote_rows),
31 encoding="utf-8",
32)
33
34try:
35 fileset = client.files.filesets.create(
36 name=DATASET_NAME,
37 description="Math expression evaluation dataset",
38 purpose="dataset",
39 metadata={
40 "dataset": {
41 "schema": {
42 "type": "object",
43 "properties": {
44 "question": {"type": "string"},
45 "expected": {"type": "number"},
46 "tolerance": {"type": "number"},
47 },
48 "required": ["question", "expected"],
49 "additionalProperties": True,
50 }
51 }
52 },
53 )
54 print(f"Created fileset: {fileset.workspace}/{fileset.name}")
55except ConflictError:
56 fileset = client.files.filesets.retrieve(name=DATASET_NAME)
57 print(f"Fileset exists: {fileset.workspace}/{fileset.name}")
58
59client.files.upload(
60 fileset=fileset.name,
61 local_path=str(dataset_path),
62 remote_path=DATASET_FILE,
63)
64
65dataset_ref = FilesetRef(root=f"{fileset.workspace}/{fileset.name}").with_fragment(DATASET_FILE)
66print(f"Using dataset: {dataset_ref.root}")

The nemo_evaluator_sdk package provides metric and runtime types, such as Metric, RunConfigOnlineModel, and ModelRef. The nemo_evaluator.sdk package provides platform submission helpers, such as FilesetRef, that are specific to durable evaluator jobs.

7. Submit a Durable Evaluator Job

Pass CloudpickleMetricBundlePackager() to client.evaluator.submit(...) so the SDK serializes the metric object into the evaluator job spec. The job runtime hydrates the metric bundle before scoring rows. Use ModelRef to reference the platform model entity you configured with NMP_EVAL_MODEL_REF.

1from nemo_evaluator.shared.metric_bundles.cloudpickle import CloudpickleMetricBundlePackager
2from nemo_evaluator_sdk import InferenceParams, ModelRef, RunConfigOnlineModel
3
4
5job = client.evaluator.submit(
6 metric=metric,
7 dataset=dataset_ref,
8 config=RunConfigOnlineModel(
9 parallelism=2,
10 limit_samples=3,
11 inference=InferenceParams(
12 temperature=0.0,
13 max_tokens=32,
14 ),
15 ),
16 target=ModelRef(root=MODEL_REF),
17 prompt_template=(
18 "Return exactly one valid Python arithmetic expression that evaluates to the answer.\n"
19 "Start immediately with the expression. Do not start with a newline.\n"
20 "Do not include markdown, code fences, prose, units, the final answer, or any explanation.\n"
21 "Use only numbers, whitespace, parentheses, and these operators: +, -, *, /, //, %.\n\n"
22 "Question: {{item.question}}\n"
23 "Expression:"
24 ),
25 metric_bundle_packager=CloudpickleMetricBundlePackager(),
26)
27
28print(f"Submitted job: {job.name}")
29job.wait_until_done()
30remote_result = job.get_result()
31remote_result.print_summary()

Cloudpickle metric bundles execute serialized Python code when the job hydrates the metric. Use this path only for metric code that you fully understand and trust.

8. Inspect Results

The aggregate scores show how often the outputs followed the expression contract and how often the expression evaluated to the expected value.

1for score in remote_result.aggregate_scores.scores:
2 print(f"{score.name}: mean={score.mean}, count={score.count}, nan_count={score.nan_count}")

If valid_expression is low, inspect row scores before changing the metric. Some models return the right expression followed by explanation text, which is not a valid Python expression and should fail this metric.

Use row scores to debug the expressions:

1for row in remote_result.row_scores:
2 print("Question:", row.item["question"])
3 print("Model output:", row.sample.get("output_text"))
4 print("Metric outputs:", row.metrics)
5 print()

Use the local result object from Step 5 the same way if you want to inspect the local run instead.

Best Practices

  • Run custom metrics locally with representative rows before submitting service-side jobs.
  • Keep metric code deterministic and side-effect free.
  • Prefer Python standard-library logic unless you know the dependency is available in the service runtime.
  • Emit separate outputs for format validity and task correctness so failures are easier to diagnose.
  • Pass an explicit metric bundle packager when submitting custom Python metrics as durable jobs.
  • Use FilesetRef for reusable datasets and ModelRef for platform-managed model routing.

Next Steps