> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://nemo-platform.docs.buildwithfern.com/nemo/platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://nemo-platform.docs.buildwithfern.com/nemo/platform/_mcp/server.

# Evaluation Metrics

<a id="eval-metrics-index" />

Metrics define how to score the outputs of your models, agents, or pipelines.

## What is a metric?

A metric is a scoring definition that evaluates model or agent outputs. In the Evaluator plugin SDK, metrics are inline Python objects passed directly to `evaluator.run(...)` or `evaluator.submit(...)`.

* **Inputs**: For custom metrics, inputs define scoring logic composed of dataset fields and model outputs; for judge-based custom metrics, this also includes judge-model inputs (for example, judge prompts/rubrics and configuration).
* **Outputs**: Row-level scores and aggregate statistics.
* **Execution**: Metric objects run with `dataset`, optional runtime configuration, and an optional model or agent target.

Terminology on this page:

* **Metric definition**: The reusable scoring configuration.
* **Metric type**: The metric family (for example exact-match, BLEU, LLM-as-a-judge).
* **Metric score**: The numeric or rubric output produced at evaluation time.

## The Evaluation Workflow

```text
[1] Choose and configure a metric object
 |
 v
[2] Select a dataset and execution mode
 |
 v
[3] Create and run an evaluation job
 |
 v
[4] Review row-level and aggregate scores
```

## Quick Start

Minimal sync evaluation with a built-in metric:

```python
import os

from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform
from nemo_evaluator_sdk import ExactMatchMetric

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator

metric = ExactMatchMetric(reference="{{item.expected}}", candidate="{{item.output}}")

result = evaluator.run(
    metric=metric,
    dataset=[
        {"expected": "Paris", "output": "Paris"},
        {"expected": "Berlin", "output": "Munich"},
    ],
)

print(result.aggregate_scores)
```

## Execution Modes

Metrics can be executed in two modes:

| Mode                | Use Case                                                         | Response                    |
| ------------------- | ---------------------------------------------------------------- | --------------------------- |
| **Live Evaluation** | Rapid prototyping, developing metrics, testing configurations.   | Immediate (synchronous)     |
| **Job Evaluation**  | Production workloads, full datasets, durability, and persistence | Async (poll for completion) |

### Online Job Targets: Model or Agent

Online evaluation jobs can target either a **model** (an OpenAI-compatible chat completions endpoint) or an **agent** (any HTTP endpoint, including agentic systems with tool use and multi-step reasoning). Provide one or the other — the platform routes your request to the correct job type automatically.

| Target    | When to use                                                                                                 |
| --------- | ----------------------------------------------------------------------------------------------------------- |
| **Model** | Standalone LLM endpoints using a standard chat completions API.                                             |
| **Agent** | Agentic systems, NeMo Agent Toolkit workflows, or custom HTTP endpoints with non-standard response formats. |

See [Model Configuration](/documentation/evaluate-models/metrics/model-configuration) and [Agent Configuration](/documentation/evaluate-models/metrics/agent-configuration) for setup details.

## Built-in vs. Custom Metrics

* **Built-in metrics**: Ready-to-use metrics provided by NeMo Platform (for example `exact-match`, `bleu`, `rouge`).
* **Custom metrics**: Metrics you define for domain-specific evaluation needs.

To configure inline metric objects, see [Manage Metrics](/documentation/evaluate-models/metrics/manage-metrics).
For custom metric creation guides, start with [Similarity Metrics](/documentation/evaluate-models/metrics/similarity-metrics), [LLM-as-a-Judge](/documentation/evaluate-models/metrics/llm-as-a-judge), or [Bring Your Own Metric](/documentation/evaluate-models/metrics/bring-your-own-metric).

## Datasets

Evaluation jobs need dataset input. You can provide data in two ways:

| Dataset Source  | Description                                                                                                          | Best For                              |
| --------------- | -------------------------------------------------------------------------------------------------------------------- | ------------------------------------- |
| **DatasetRows** | Inline rows sent directly in the request                                                                             | Quick testing and live evaluation     |
| **FilesetRef**  | Reference to a persisted [fileset](/documentation/get-started/core-concepts/manage-files) (`workspace/fileset-name`) | Production jobs and reusable datasets |

Example of providing a `FilesetRef` to reference specific files or globs:

```python
# Include all files in subdirectory
dataset = "my-workspace/my-dataset#subdir/path"

# Single file
dataset = "my-workspace/my-dataset#file.jsonl"

# Single file in a subdirectory
dataset = "my-workspace/my-dataset#subdir/path/file.jsonl"

# Glob match files
dataset = "my-workspace/my-dataset#*.jsonl"

# Glob match files in subdirectory
dataset = "my-workspace/my-dataset#subdir/path/*.jsonl"
```

## Available Metric Types

Use the metric-type pages below to create and configure custom metrics.

Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.

<small>
  custom-scoring

   

  rubrics
</small>

Evaluate agent workflows including tool calling accuracy, goal completion, and topic adherence.

<small>
  RAGAS

   

  tool-calling
</small>

Evaluate RAG pipelines for retrieval quality and answer generation using RAGAS metrics.

<small>
  faithfulness

   

  relevancy
</small>

Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.

<small>
  F1

   

  ROUGE

   

  BLEU
</small>

Integrate custom evaluation endpoints for domain-specific scoring.

<small>
  remote

   

  custom
</small>

Configure agent endpoints (generic or NeMo Agent Toolkit) as targets for online evaluation jobs.

<small>
  agent

   

  NAT
</small>

## Understanding Scores

Scores are the metric outputs produced during evaluation:

| Score Type           | Meaning                           | Typical Use                              |
| -------------------- | --------------------------------- | ---------------------------------------- |
| **Row scores**       | Score(s) for each dataset row     | Debugging failures and error analysis    |
| **Aggregate scores** | Statistics computed over all rows | Tracking overall quality and regressions |

## Manage Metric Definitions

Create inline metric objects that can be reused from Python helpers or modules. See [Manage Metrics](/documentation/evaluate-models/metrics/manage-metrics) for SDK patterns.