> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://nemo-platform.docs.buildwithfern.com/nemo/platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://nemo-platform.docs.buildwithfern.com/nemo/platform/_mcp/server.

# About Evaluating

<a id="nemo-ms-evaluator-about" />

Evaluation is powered by NeMo Platform, a cloud-native platform for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The evaluation API provides automated workflows for over 100 industry benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.

NeMo Platform enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost-effective and suitable for pre-deployment checks and regression testing.

**Tutorials**

**Open Source SDK**

***

## How It Works: Library + Platform

Evaluator separates **evaluation definition** from **execution**.

The code snippets below are for conceptual demonstration purposes only.
For runnable examples see the [tutorials](/documentation/evaluate-models/tutorials) and [SDK resources](/documentation/evaluate-models/sdk-resources).

### 1. Build RunConfig with the Library

Use the `nemo_evaluator_sdk` package to define your metric, dataset rows, runtime configuration, and optional model or agent target:

```python
from nemo_evaluator_sdk import RunConfig, ExactMatchMetric


# Define metric logic
metric = ExactMatchMetric(
    reference="{{item.expected}}",
    candidate="{{item.output}}",
)

# Build evaluation input
dataset = [
    {"expected": "Paris", "output": "Paris"},
    {"expected": "Berlin", "output": "Munich"},
]
config = RunConfig(limit_samples=100, parallelism=8)
```

**The library handles:** Metric definitions, dataset row schemas, prompt templates, model and agent targets, runtime parameters, retries, aggregation, and typed result objects.

### 2. Execute on the Platform

Submit your evaluation to the Evaluator service using the NeMo Platform SDK:

```python
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform


sdk = NeMoPlatform(base_url="...", workspace="default")
evaluator: Evaluator = sdk.evaluator

# Fast local iteration through the plugin runtime
local_result = evaluator.run(metric=metric, dataset=dataset, config=config)

# Production evaluation as a durable platform job
job = evaluator.submit(metric=metric, dataset=dataset, config=config)
job.wait_until_done()
result = job.get_result()
```

**The platform handles:** Job orchestration, inference routing through NeMo Platform's Inference Gateway, Fileset-based datasets, distributed execution, artifact storage, status monitoring, and result download.

## Key Differences from Standalone Library

When using Evaluator as a NeMo Platform plugin :

| Feature               | Standalone Library                   | NeMo Platform Plugin                                                                                                                      |
| --------------------- | ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
| **Execution**         | Local Python process                 | Local plugin runs for local experimentation and durable platform jobs for production                                                      |
| **Inference**         | Direct model or agent endpoint calls | The same as standalone and can also route through NeMo Platform Inference Gateway and platform-managed endpoints                          |
| **Datasets**          | Inline rows and local files          | Inline rows, local paths resolved at submission time, and NeMo Platform [Filesets](/documentation/get-started/core-concepts/manage-files) |
| **Results Artifacts** | Results stored in memory             | NeMo Platform artifact storage with typed result download                                                                                 |
| **Authentication**    | Local environment variables          | Local environment variables for local runs and NeMo Platform Secrets service for remote jobs                                              |

***

## Evaluation Concepts

NeMo Platform supports two core evaluation primitives:

* **Metrics**: Scoring logic that evaluates model outputs. Use metrics when you need flexible, reusable scoring for your own datasets and task-specific criteria.

There are two execution modes and two evaluation patterns:

* **Live evaluation (synchronous)**: Submit a request and get results immediately. Best for fast iteration, metric development, and small payloads.

* **Jobs (asynchronous)**: Submit work, monitor status, and fetch results when complete. Best for production workloads, larger datasets, and recurring regression checks.

* **Offline evaluation**: Score existing dataset rows (for example, model outputs already generated).

* **Online evaluation**: Generate outputs from a model as part of evaluation, then score them.

For deeper details, see [Evaluation Metrics](/documentation/evaluate-models/metrics).

***

## Tutorials

After [setting up a local instance of the platform](/documentation/get-started), use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different benchmarks and metrics.

Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.

<small>
  custom-dataset
</small>

Learn how to write a domain-specific Python metric, test it locally, and run it through the Evaluator service.

<small>
  custom-metric
</small>

***

## Recommended Evaluation Journey

Most teams get the best results by starting metric-first, then moving to benchmarks:

1. **Develop and validate your metrics first**

* Start with [Metrics](/documentation/evaluate-models/metrics) to define how quality should be scored for your use case.
* Use live evaluation (`POST /v2/workspaces/{workspace}/evaluation/metric-evaluate`) with small `DatasetRows` payloads to iterate quickly.

1. **Scale metric evaluation to jobs**

* When metrics are validated, run async metric jobs (`/evaluation/metric-jobs`) on larger datasets.
* Use filesets for production-scale inputs. See [Manage Files](/documentation/get-started/core-concepts/manage-files).

1. **Monitor and analyze results**

* Track job status and progress with job management APIs.
* Retrieve results and artifacts for analysis, reporting, and regression tracking.

***

## Where to Go Next

* For metric workflows, see [Metric Jobs](/documentation/evaluate-models/metrics) and Metric Results.
* For full endpoint details, see the [Evaluator API Reference](/documentation/reference/api-reference).

***

### Available Evaluations

Review configurations, data formats, and result examples for each evaluation.

* **Retrieval** — Evaluate document retrieval pipelines on standard or custom datasets.
* **RAG** — Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
* **Agentic** — Assess agent-based and multi-step reasoning models, including topic adherence and tool use.
* **LLM-as-a-Judge** — Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.
* **Similarity Metrics** — Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.