Agentic Evaluation Metrics
Evaluate agent-based and multi-step reasoning models using metrics powered by RAGAS. These metrics assess tool calling accuracy, goal completion, topic adherence, and answer correctness in agentic workflows.
Agent Workflow Evaluation Stages
Key stages of agent workflow evaluation:

1. Intermediate Steps Evaluation Assesses the correctness of intermediate steps during agent execution:
- Tool Use: Validates that the agent invoked the right tools with correct arguments at each step. Refer to Tool Call Accuracy for implementation details.
2. Final-Step Evaluation Evaluates the quality of the agent’s final output using:
- Agent Goal Accuracy: Measures whether the agent successfully completed the requested task. Refer to Agent Goal Accuracy.
- Topic Adherence: Assesses how well the agent maintained focus on the assigned topic throughout the conversation. Refer to Topic Adherence.
- Answer Accuracy: Evaluates the factual correctness of agent answers. Refer to Answer Accuracy.
- Custom Metrics: For domain-specific or custom evaluation criteria, use LLM-as-a-Judge with the
datatask type.
3. Trajectory Evaluation Evaluates the agent’s decision-making process by analyzing the entire sequence of actions taken to accomplish a goal. This includes assessing whether the agent chose appropriate tools in the correct order. Refer to Trajectory Evaluation for the expected data format and current plugin SDK support.
Online vs Offline Evaluation
Agentic metrics support two execution patterns through the Evaluator plugin SDK:
Offline Evaluation
Offline evaluation scores pre-generated responses or tool calls already present in the dataset:
- Dataset rows are passed inline with
dataset=[...], as a file Path, or as a FilesetRef. - No model or agent generation is performed.
- Use this mode to evaluate existing agent outputs or compare different response strategies.
Online Target Generation
Online target generation first calls a model or agent target, then evaluates the generated output:
- Configure the target with
ModelorAgentfromnemo_evaluator_sdk. - Pass the target through
target=.... - Use
RunConfigOnlineModelfor model targets andRunConfigOnlinefor agent targets. - Include a
prompt_templatewhen the dataset row must be transformed into a model or agent request.
Response Usage: In online target generation, the generated response is used as the metric response. A dataset response column is optional and is superseded by the generated response for that run.
Overview
Agentic metrics evaluate different aspects of agent behavior:
Use evaluator.run(...) for local in-process evaluation and evaluator.submit(...) for durable remote platform jobs. The examples below use inline dataset rows through dataset=[...], but you can use a file Path or a FilesetRef instead.
Prerequisites
Before running agentic evaluations:
- Workspace: Have a workspace created. All remote resources, including secrets and jobs, are scoped to a workspace.
- Judge LLM endpoint (for most metrics): Have access to an LLM that will serve as your judge.
- API key secret (if judge requires auth): If your judge endpoint requires authentication, create a secret to store the API key. For local
runversus remotesubmitbehavior, see Model API Authentication. - Initialize the SDK:
Creating a Secret for API Keys
If using external endpoints, such as Build NVIDIA API endpoints (https://integrate.api.nvidia.com/v1), create a secret first:
SDK Types Reference
The plugin SDK examples use context-agnostic metric classes and runtime value classes:
Use dataset=[...] for inline rows. For offline scoring options, use config=RunConfig(parallelism=...). Whenever outputs must be generated before scoring, pass target=Model(...) or target=Agent(...) plus the corresponding online parameters. Use the same dataset, config, and target arguments for both evaluator.run(...) and evaluator.submit(...); durable jobs follow the identical pattern as local runs and only differ in waiting for and fetching results.
Agentic Metrics
Tool Call Accuracy
Evaluates whether the agent invoked the correct tools with the correct arguments. This metric does not require a judge LLM.
Online/offline support: Tool Call Accuracy supports scoring existing tool calls directly. It can also score tool calls produced during online target generation when the target response includes the required tool-call fields.
Data Format
Run Locally
Submit Job
Result
Tool Calling (Template)
A template-based metric for evaluating tool/function call accuracy. Unlike the RAGAS Tool Call Accuracy metric, this metric uses configurable templates and produces multiple scores.
Online/offline support: Tool Calling supports scoring existing tool-call outputs directly. Use online target generation when you want the model or agent target to produce the response first.
Scores Produced
function_name_accuracy- Accuracy of function names onlyfunction_name_and_args_accuracy- Accuracy of both function names and arguments
Data Format
Data must use OpenAI-compliant tool calling format:
- Function names with dots (
.) must be replaced with underscores (_). - Comparison is case-sensitive.
- Order of tool calls is ignored, which supports parallel tool calling.
Run Locally
Submit Job
Result
Topic Adherence
Measures how well the agent maintained focus on assigned topics throughout a conversation. Supports F1, precision, or recall scoring modes.
Data Format
Configuration Options
Run Locally
Submit Job
Online Target Generation
Result
Agent Goal Accuracy
Evaluates whether the agent successfully completed the requested task. Returns a binary score (0 or 1). Supports evaluation with or without a reference outcome.
With Reference
Compare the agent’s outcome against a known reference:
Data Format
Configuration Options
Run Locally
Submit Job
Result
Without Reference
The judge LLM infers the goal from the conversation context:
Data Format
Run Locally
Submit Job
Answer Accuracy
Evaluates the factual correctness of an agent’s answer by comparing it against a reference answer. Two LLM judges independently rate the agreement, and scores are averaged. Scores range from 0 (incorrect) to 0.5 (partial match) to 1 (exact match).
Data Format
Run Locally
Submit Job
Online Target Generation
Result
Trajectory Evaluation
Evaluates the agent’s decision-making process by analyzing the entire sequence of actions (trajectory) taken to accomplish a goal. This system metric assesses whether the agent chose appropriate tools in the correct order.
NAT Format Requirement: This metric supports the NVIDIA Agent Toolkit format with intermediate_steps containing detailed event traces.
Current plugin SDK support: The current plugin SDK does not expose a typed trajectory-evaluation metric class. Use the data-format details below when preparing datasets for environments where the system metric is enabled, but do not use the old generated evaluator job APIs for plugin SDK execution.
Data Format
Each data entry must follow the NeMo Agent Toolkit format:
Parameters
Judge Configuration
Most agentic metrics require a judge LLM. Configure the judge model in the metric definition:
Recommended model size: Use a 70B+ parameter model as the judge for reliable results. Smaller models may fail to follow the required output schema, causing parsing errors.
Using Reasoning Models
For models that support extended reasoning, such as nvidia/llama-3.3-nemotron-super-49b-v1, add system prompt and reasoning parameters to online model generation:
Managing Secrets for Authenticated Endpoints
If your judge endpoint requires an API key, store it as a secret:
For more details on secret management, refer to Managing Secrets.
For local run versus remote submit behavior of api_key_secret, see Model API Authentication.
Job Management
After submitting a durable remote job with evaluator.submit(metric=metric, dataset=dataset), use the returned job resource to monitor execution and retrieve results:
Navigate to Metrics Job Management for more job lifecycle details.
Dataset Notes
The current plugin SDK examples use inline dataset rows through dataset=[...]. Keep each row shaped for the selected metric, including fields such as user_input, response, reference, reference_topics, reference_tool_calls, or OpenAI-style tool_calls as required.
Use RunConfig(limit_samples=...) when you want to test a small slice of a larger inline dataset before submitting the full request.
Important Notes
-
Execution choice: Use
runfor local in-process evaluation andsubmitfor durable remote jobs withwait_until_done()andget_result(). -
Column Names: RAGAS metrics use specific column names:
user_input(notquestion)response(notanswer)reference(notground_truth)
-
Judge Model Quality: For metrics requiring a judge, evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) produce more consistent results.
-
RAGAS Dependency: These metrics are powered by RAGAS and may have version-specific behavior.
-
NaN troubleshooting for judge-based metrics: If you see
nan_count > 0withmean = null, check judge model authentication first (API key secret, endpoint access, and model permissions). See Model API Authentication forapi_key_secretbehavior. Some RAGAS metrics are known to convert auth failures intoNaNscores instead of raising a hard error.
- Agent Configuration - Use agents (generic or NAT) as targets in online evaluation jobs
- Agentic Benchmarks - BFCL benchmark for tool-calling evaluation
- LLM-as-a-Judge - Custom judge-based evaluation
- Evaluation Results - Understanding results
- RAG Metrics - RAGAS metrics for RAG pipelines