Evaluation Metrics
Metrics define how to score the outputs of your models, agents, or pipelines.
What is a metric?
A metric is a scoring definition that evaluates model or agent outputs. In the Evaluator plugin SDK, metrics are inline Python objects passed directly to evaluator.run(...) or evaluator.submit(...).
- Inputs: For custom metrics, inputs define scoring logic composed of dataset fields and model outputs; for judge-based custom metrics, this also includes judge-model inputs (for example, judge prompts/rubrics and configuration).
- Outputs: Row-level scores and aggregate statistics.
- Execution: Metric objects run with
dataset, optional runtime configuration, and an optional model or agent target.
Terminology on this page:
- Metric definition: The reusable scoring configuration.
- Metric type: The metric family (for example exact-match, BLEU, LLM-as-a-judge).
- Metric score: The numeric or rubric output produced at evaluation time.
The Evaluation Workflow
Quick Start
Minimal sync evaluation with a built-in metric:
Execution Modes
Metrics can be executed in two modes:
Online Job Targets: Model or Agent
Online evaluation jobs can target either a model (an OpenAI-compatible chat completions endpoint) or an agent (any HTTP endpoint, including agentic systems with tool use and multi-step reasoning). Provide one or the other — the platform routes your request to the correct job type automatically.
See Model Configuration and Agent Configuration for setup details.
Built-in vs. Custom Metrics
- Built-in metrics: Ready-to-use metrics provided by NeMo Platform (for example
exact-match,bleu,rouge). - Custom metrics: Metrics you define for domain-specific evaluation needs.
To configure inline metric objects, see Manage Metrics. For custom metric creation guides, start with Similarity Metrics, LLM-as-a-Judge, or Bring Your Own Metric.
Datasets
Evaluation jobs need dataset input. You can provide data in two ways:
Example of providing a FilesetRef to reference specific files or globs:
Available Metric Types
Use the metric-type pages below to create and configure custom metrics.
Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.
custom-scoring rubricsEvaluate agent workflows including tool calling accuracy, goal completion, and topic adherence.
RAGAS tool-callingEvaluate RAG pipelines for retrieval quality and answer generation using RAGAS metrics.
faithfulness relevancyCreate metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.
F1 ROUGE BLEUIntegrate custom evaluation endpoints for domain-specific scoring.
remote customConfigure agent endpoints (generic or NeMo Agent Toolkit) as targets for online evaluation jobs.
agent NATUnderstanding Scores
Scores are the metric outputs produced during evaluation:
Manage Metric Definitions
Create inline metric objects that can be reused from Python helpers or modules. See Manage Metrics for SDK patterns.