Agentic Evaluation Metrics | NVIDIA NeMo Platform

Evaluate agent-based and multi-step reasoning models using metrics powered by RAGAS. These metrics assess tool calling accuracy, goal completion, topic adherence, and answer correctness in agentic workflows.

Agent Workflow Evaluation Stages

Key stages of agent workflow evaluation:

Agent Evaluation Framework

1. Intermediate Steps Evaluation Assesses the correctness of intermediate steps during agent execution:

Tool Use: Validates that the agent invoked the right tools with correct arguments at each step. Refer to Tool Call Accuracy for implementation details.

2. Final-Step Evaluation Evaluates the quality of the agent’s final output using:

Agent Goal Accuracy: Measures whether the agent successfully completed the requested task. Refer to Agent Goal Accuracy.
Topic Adherence: Assesses how well the agent maintained focus on the assigned topic throughout the conversation. Refer to Topic Adherence.
Answer Accuracy: Evaluates the factual correctness of agent answers. Refer to Answer Accuracy.
Custom Metrics: For domain-specific or custom evaluation criteria, use LLM-as-a-Judge with the data task type.

3. Trajectory Evaluation Evaluates the agent’s decision-making process by analyzing the entire sequence of actions taken to accomplish a goal. This includes assessing whether the agent chose appropriate tools in the correct order. Refer to Trajectory Evaluation for the expected data format and current plugin SDK support.

Online vs Offline Evaluation

Agentic metrics support two execution patterns through the Evaluator plugin SDK:

Offline Evaluation

Offline evaluation scores pre-generated responses or tool calls already present in the dataset:

Dataset rows are passed inline with dataset=[...], as a file Path, or as a FilesetRef.
No model or agent generation is performed.
Use this mode to evaluate existing agent outputs or compare different response strategies.

Online Target Generation

Online target generation first calls a model or agent target, then evaluates the generated output:

Configure the target with Model or Agent from nemo_evaluator_sdk.
Pass the target through target=....
Use RunConfigOnlineModel for model targets and RunConfigOnline for agent targets.
Include a prompt_template when the dataset row must be transformed into a model or agent request.

Response Usage: In online target generation, the generated response is used as the metric response. A dataset response column is optional and is superseded by the generated response for that run.

Overview

Agentic metrics evaluate different aspects of agent behavior:

Metric	Use Case	Requires Judge	Plugin SDK Execution
Tool Call Accuracy	Evaluates tool/function call correctness	No	`run` + `submit`
Tool Calling (template)	Evaluates tool/function call correctness using Jinja templates	No	`run` + `submit`
Topic Adherence	Measures topic focus in multi-turn conversations	Yes	`run` + `submit`
Agent Goal Accuracy	Assesses goal completion, with or without reference.	Yes	`run` + `submit`
Answer Accuracy	Checks factual correctness	Yes	`run` + `submit`
Trajectory Evaluation	Evaluates decision-making across action sequence	Yes	Not exposed as a typed plugin SDK metric

Use evaluator.run(...) for local in-process evaluation and evaluator.submit(...) for durable remote platform jobs. The examples below use inline dataset rows through dataset=[...], but you can use a file Path or a FilesetRef instead.

Prerequisites

Before running agentic evaluations:

Workspace: Have a workspace created. All remote resources, including secrets and jobs, are scoped to a workspace.
Judge LLM endpoint (for most metrics): Have access to an LLM that will serve as your judge.
API key secret (if judge requires auth): If your judge endpoint requires authentication, create a secret to store the API key. For local run versus remote submit behavior, see Model API Authentication.
Initialize the SDK:

1 import os
2 from nemo_evaluator.sdk import Evaluator
3 from nemo_platform import NeMoPlatform
4 
5 
6 client = NeMoPlatform(
7     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
8     workspace="default",
9 )
10 evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Creating a Secret for API Keys

If using external endpoints, such as Build NVIDIA API endpoints (https://integrate.api.nvidia.com/v1), create a secret first:

1 client.secrets.create(
2     name="nvidia-api-key",
3     value="nvapi-YOUR_API_KEY_HERE",
4     description="NVIDIA Build API key for RAGAS metrics",
5 )

SDK Types Reference

The plugin SDK examples use context-agnostic metric classes and runtime value classes:

1 from nemo_evaluator_sdk import (
2     ToolCallingMetric,
3 )
4 from nemo_evaluator_sdk.metrics.ragas import (
5     AgentGoalAccuracyMetric,
6     AnswerAccuracyMetric,
7     ToolCallAccuracyMetric,
8     TopicAdherenceMetric,
9 )
10 from nemo_evaluator_sdk import (
11     Agent,
12     RunConfigOnlineModel,
13     RunConfigOnline,
14     RunConfig,
15     InferenceParams,
16     Model,
17     ReasoningParams,
18 )

Use dataset=[...] for inline rows. For offline scoring options, use config=RunConfig(parallelism=...). Whenever outputs must be generated before scoring, pass target=Model(...) or target=Agent(...) plus the corresponding online parameters. Use the same dataset, config, and target arguments for both evaluator.run(...) and evaluator.submit(...); durable jobs follow the identical pattern as local runs and only differ in waiting for and fetching results.

Agentic Metrics

Tool Call Accuracy

Evaluates whether the agent invoked the correct tools with the correct arguments. This metric does not require a judge LLM.

Online/offline support: Tool Call Accuracy supports scoring existing tool calls directly. It can also score tool calls produced during online target generation when the target response includes the required tool-call fields.

Data Format

1 {
2   "user_input": [
3     {
4       "content": "What's the weather like in New York?",
5       "type": "human"
6     },
7     {
8       "content": "Let me check that for you.",
9       "type": "ai",
10       "tool_calls": [
11         {
12           "name": "weather_check",
13           "args": {
14             "location": "New York"
15           }
16         }
17       ]
18     },
19     {
20       "content": "It's 75°F and partly cloudy.",
21       "type": "tool"
22     },
23     {
24       "content": "The weather in New York is 75°F and partly cloudy.",
25       "type": "ai"
26     }
27   ],
28   "reference_tool_calls": [
29     {
30       "name": "weather_check",
31       "args": {
32         "location": "New York"
33       }
34     }
35   ]
36 }

Run Locally

Submit Job

Result

1 from nemo_evaluator_sdk.metrics.ragas import ToolCallAccuracyMetric
2 metric = ToolCallAccuracyMetric()
3 
4 result = evaluator.run(
5     metric=metric,
6     dataset=[
7         {
8             "user_input": [
9                 {"content": "What's the weather in Paris?", "type": "human"},
10                 {
11                     "content": "Let me check.",
12                     "type": "ai",
13                     "tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}],
14                 },
15                 {"content": "Sunny, 22°C", "type": "tool"},
16                 {"content": "It's sunny and 22°C in Paris.", "type": "ai"},
17             ],
18             "reference_tool_calls": [
19                 {"name": "weather_api", "args": {"city": "Paris"}}
20             ],
21         }
22     ],
23 )
24 print(result.aggregate_scores)

Tool Calling (Template)

A template-based metric for evaluating tool/function call accuracy. Unlike the RAGAS Tool Call Accuracy metric, this metric uses configurable templates and produces multiple scores.

Online/offline support: Tool Calling supports scoring existing tool-call outputs directly. Use online target generation when you want the model or agent target to produce the response first.

Scores Produced

function_name_accuracy - Accuracy of function names only
function_name_and_args_accuracy - Accuracy of both function names and arguments

Data Format

Data must use OpenAI-compliant tool calling format:

1 {
2   "messages": [
3     {
4       "role": "user",
5       "content": "Book a table for 2 at 7pm."
6     },
7     {
8       "role": "assistant",
9       "content": "Booking a table...",
10       "tool_calls": [
11         {
12           "function": {
13             "name": "book_table",
14             "arguments": {
15               "people": 2,
16               "time": "7pm"
17             }
18           }
19         }
20       ]
21     }
22   ],
23   "tool_calls": [
24     {
25       "function": {
26         "name": "book_table",
27         "arguments": {
28           "people": 2,
29           "time": "7pm"
30         }
31       }
32     }
33   ],
34   "response": {
35     "choices": [
36       {
37         "message": {
38           "tool_calls": [
39             {
40               "function": {
41                 "name": "book_table",
42                 "arguments": "{\"people\": 2, \"time\": \"7pm\"}"
43               }
44             }
45           ]
46         }
47       }
48     ]
49   }
50 }

Function names with dots (.) must be replaced with underscores (_).
Comparison is case-sensitive.
Order of tool calls is ignored, which supports parallel tool calling.

Run Locally

Submit Job

Result

1 from nemo_evaluator_sdk import ToolCallingMetric
2 metric = ToolCallingMetric(reference="{{item.tool_calls}}")
3 
4 result = evaluator.run(
5     metric=metric,
6     dataset=[
7         {
8             "messages": [
9                 {"role": "user", "content": "Book a table for 2 at 7pm."},
10                 {
11                     "role": "assistant",
12                     "content": "Booking...",
13                     "tool_calls": [
14                         {
15                             "function": {
16                                 "name": "book_table",
17                                 "arguments": {"people": 2, "time": "7pm"},
18                             }
19                         }
20                     ],
21                 },
22             ],
23             "tool_calls": [
24                 {
25                     "function": {
26                         "name": "book_table",
27                         "arguments": {"people": 2, "time": "7pm"},
28                     }
29                 }
30             ],
31             "response": {
32                 "choices": [
33                     {
34                         "message": {
35                             "tool_calls": [
36                                 {
37                                     "function": {
38                                         "name": "book_table",
39                                         "arguments": '{"people": 2, "time": "7pm"}',
40                                     }
41                                 }
42                             ]
43                         }
44                     }
45                 ]
46             },
47         }
48     ],
49 )
50 print(result.aggregate_scores)

Topic Adherence

Measures how well the agent maintained focus on assigned topics throughout a conversation. Supports F1, precision, or recall scoring modes.

Data Format

1 {
2   "user_input": [
3     {
4       "content": "How do I stay healthy?",
5       "type": "human"
6     },
7     {
8       "content": "Eat more fruits and vegetables, and exercise regularly.",
9       "type": "ai"
10     }
11   ],
12   "reference_topics": [
13     "health",
14     "nutrition",
15     "fitness"
16   ]
17 }

Configuration Options

Parameter	Type	Default	Description
`metric_mode`	string	`"f1"`	Scoring mode: `"f1"`, `"precision"`, or `"recall"`

Run Locally

Submit Job

Online Target Generation

Result

1 from nemo_evaluator_sdk import Model
2 from nemo_evaluator_sdk.metrics.ragas import TopicAdherenceMetric
3 
4 judge_model = Model(
5     url="https://integrate.api.nvidia.com/v1/chat/completions",
6     name="meta/llama-3.1-70b-instruct",
7     api_key_secret="nvidia-api-key",
8 )
9 metric = TopicAdherenceMetric(metric_mode="f1", judge_model=judge_model)
10 
11 result = evaluator.run(
12     metric=metric,
13     dataset=[
14         {
15             "user_input": [
16                 {"content": "Tell me about healthy eating", "type": "human"},
17                 {
18                     "content": "Eating fruits and vegetables is essential for good health.",
19                     "type": "ai",
20                 },
21             ],
22             "reference_topics": ["health", "nutrition", "diet"],
23         }
24     ],
25 )
26 print(result.aggregate_scores)

Agent Goal Accuracy

Evaluates whether the agent successfully completed the requested task. Returns a binary score (0 or 1). Supports evaluation with or without a reference outcome.

With Reference

Compare the agent’s outcome against a known reference:

Data Format

1 {
2   "user_input": [
3     {
4       "content": "Book a table at a Chinese restaurant for 8pm",
5       "type": "human"
6     },
7     {
8       "content": "I'll search for restaurants.",
9       "type": "ai",
10       "tool_calls": [
11         {
12           "name": "restaurant_search",
13           "args": {}
14         }
15       ]
16     },
17     {
18       "content": "Found: Italian Place",
19       "type": "tool"
20     },
21     {
22       "content": "Your table at Italian Place is booked for 8pm.",
23       "type": "ai"
24     }
25   ],
26   "reference": "Successfully booked a table at a restaurant for 8pm"
27 }

Configuration Options

Parameter	Type	Default	Description
`use_reference`	boolean	`True`	Whether to compare against a reference outcome

Run Locally

Submit Job

Result

1 from nemo_evaluator_sdk import Model
2 from nemo_evaluator_sdk.metrics.ragas import AgentGoalAccuracyMetric
3 
4 judge_model = Model(
5     url="https://integrate.api.nvidia.com/v1/chat/completions",
6     name="meta/llama-3.1-70b-instruct",
7     api_key_secret="nvidia-api-key",
8 )
9 metric = AgentGoalAccuracyMetric(use_reference=True, judge_model=judge_model)
10 
11 result = evaluator.run(
12     metric=metric,
13     dataset=[
14         {
15             "user_input": [
16                 {"content": "Book a table at a restaurant for 8pm", "type": "human"},
17                 {
18                     "content": "I'll search for restaurants.",
19                     "type": "ai",
20                     "tool_calls": [{"name": "restaurant_search", "args": {}}],
21                 },
22                 {"content": "Found: Italian Place", "type": "tool"},
23                 {
24                     "content": "Your table at Italian Place is booked for 8pm.",
25                     "type": "ai",
26                 },
27             ],
28             "reference": "Successfully booked a table at a restaurant for 8pm",
29         }
30     ],
31 )
32 print(result.aggregate_scores)

Without Reference

The judge LLM infers the goal from the conversation context:

Data Format

1 {
2   "user_input": [
3     {
4       "content": "Set a reminder for my dentist appointment tomorrow at 2pm",
5       "type": "human"
6     },
7     {
8       "content": "I'll set that reminder for you.",
9       "type": "ai",
10       "tool_calls": [
11         {
12           "name": "set_reminder",
13           "args": {
14             "title": "Dentist appointment",
15             "date": "tomorrow",
16             "time": "2pm"
17           }
18         }
19       ]
20     },
21     {
22       "content": "Reminder set successfully.",
23       "type": "tool"
24     },
25     {
26       "content": "Your reminder for the dentist appointment tomorrow at 2pm has been set.",
27       "type": "ai"
28     }
29   ]
30 }

Run Locally

Submit Job

1 from nemo_evaluator_sdk import Model
2 from nemo_evaluator_sdk.metrics.ragas import AgentGoalAccuracyMetric
3 
4 judge_model = Model(
5     url="https://integrate.api.nvidia.com/v1/chat/completions",
6     name="meta/llama-3.1-70b-instruct",
7     api_key_secret="nvidia-api-key",
8 )
9 metric = AgentGoalAccuracyMetric(use_reference=False, judge_model=judge_model)
10 
11 result = evaluator.run(
12     metric=metric,
13     dataset=[
14         {
15             "user_input": [
16                 {
17                     "content": "Set a reminder for my dentist appointment tomorrow at 2pm",
18                     "type": "human",
19                 },
20                 {
21                     "content": "I'll set that reminder for you.",
22                     "type": "ai",
23                     "tool_calls": [
24                         {
25                             "name": "set_reminder",
26                             "args": {
27                                 "title": "Dentist appointment",
28                                 "date": "tomorrow",
29                                 "time": "2pm",
30                             },
31                         }
32                     ],
33                 },
34                 {"content": "Reminder set successfully.", "type": "tool"},
35                 {"content": "Your reminder has been set.", "type": "ai"},
36             ],
37         }
38     ],
39 )
40 print(result.aggregate_scores)

Answer Accuracy

Evaluates the factual correctness of an agent’s answer by comparing it against a reference answer. Two LLM judges independently rate the agreement, and scores are averaged. Scores range from 0 (incorrect) to 0.5 (partial match) to 1 (exact match).

Data Format

1 {
2   "user_input": "What is the capital of France?",
3   "response": "The capital of France is Paris.",
4   "reference": "Paris"
5 }

Run Locally

Submit Job

Online Target Generation

Result

1 from nemo_evaluator_sdk import Model
2 from nemo_evaluator_sdk.metrics.ragas import AnswerAccuracyMetric
3 
4 judge_model = Model(
5     url="https://integrate.api.nvidia.com/v1/chat/completions",
6     name="meta/llama-3.1-70b-instruct",
7     api_key_secret="nvidia-api-key",
8 )
9 metric = AnswerAccuracyMetric(judge_model=judge_model)
10 
11 result = evaluator.run(
12     metric=metric,
13     dataset=[
14         {
15             "user_input": "What is the capital of France?",
16             "response": "The capital of France is Paris.",
17             "reference": "Paris",
18         }
19     ],
20 )
21 print(result.aggregate_scores)

Trajectory Evaluation

Evaluates the agent’s decision-making process by analyzing the entire sequence of actions (trajectory) taken to accomplish a goal. This system metric assesses whether the agent chose appropriate tools in the correct order.

NAT Format Requirement: This metric supports the NVIDIA Agent Toolkit format with intermediate_steps containing detailed event traces.

Current plugin SDK support: The current plugin SDK does not expose a typed trajectory-evaluation metric class. Use the data-format details below when preparing datasets for environments where the system metric is enabled, but do not use the old generated evaluator job APIs for plugin SDK execution.

Data Format

Each data entry must follow the NeMo Agent Toolkit format:

1 {
2   "question": "What are LLMs",
3   "generated_answer": "LLMs, or Large Language Models, are a type of artificial intelligence designed to process and generate human-like language. They are trained on vast amounts of text data and can be fine-tuned for specific tasks or guided by prompt engineering.",
4   "answer": "LLMs stand for Large Language Models, which are a type of machine learning model designed for natural language processing tasks such as language generation.",
5   "intermediate_steps": [
6     {
7       "payload": {
8         "event_type": "LLM_END",
9         "name": "nvidia/llama-3.3-nemotron-super-49b-v1",
10         "data": {
11           "input": "\nPrevious conversation history:\n\n\nQuestion: What are LLMs\n",
12           "output": "Thought: I need to find information about LLMs to answer this question.\n\nAction: wikipedia_search\nAction Input: {'question': 'LLMs'}\n\n"
13         }
14       }
15     },
16     {
17       "payload": {
18         "event_type": "TOOL_END",
19         "name": "wikipedia_search",
20         "data": {
21           "input": "{'question': 'LLMs'}",
22           "output": "<Document source=\"https://en.wikipedia.org/wiki/Large_language_model\" page=\"\"/>\nA large language model (LLM) is a language model trained with self-supervised machine learning..."
23         }
24       }
25     },
26     {
27       "payload": {
28         "event_type": "LLM_END",
29         "name": "meta/llama-3.1-70b-instruct",
30         "data": {
31           "input": "...",
32           "output": "Thought: I now know the final answer\n\nFinal Answer: LLMs, or Large Language Models, are a type of artificial intelligence..."
33         }
34       }
35     }
36   ]
37 }

Parameters

Parameter	Required	Type	Description
`judge`	required	object	Judge LLM configuration
`trajectory_used_tools`	required	string	Comma-separated list of tools available to the agent. Example: `"wikipedia_search,current_datetime,code_generation"`
`trajectory_custom_tools`	optional	object	JSON mapping custom tool names to descriptions for non-default tools

Judge Configuration

Most agentic metrics require a judge LLM. Configure the judge model in the metric definition:

1 from nemo_evaluator_sdk import Model
2 
3 
4 judge_model = Model(
5     url="https://integrate.api.nvidia.com/v1/chat/completions",
6     name="meta/llama-3.1-70b-instruct",
7     api_key_secret="nvidia-api-key",
8 )

Recommended model size: Use a 70B+ parameter model as the judge for reliable results. Smaller models may fail to follow the required output schema, causing parsing errors.

Using Reasoning Models

For models that support extended reasoning, such as nvidia/llama-3.3-nemotron-super-49b-v1, add system prompt and reasoning parameters to online model generation:

1 from nemo_evaluator_sdk import RunConfigOnlineModel, InferenceParams, Model, ReasoningParams
2 
3 
4 judge_model = Model(
5     url="https://integrate.api.nvidia.com/v1/chat/completions",
6     name="nvidia/llama-3.3-nemotron-super-49b-v1",
7     api_key_secret="nvidia-api-key",
8 )
9 
10 config = RunConfigOnlineModel(
11     system_prompt="detailed thinking on",
12     reasoning=ReasoningParams(end_token="</think>"),
13     inference=InferenceParams(max_tokens=1024),
14 )

Managing Secrets for Authenticated Endpoints

If your judge endpoint requires an API key, store it as a secret:

1 from nemo_evaluator_sdk import Model
2 
3 client.secrets.create(name="nvidia-api-key", value="nvapi-YOUR_API_KEY_HERE")
4 
5 judge_model = Model(
6     url="https://integrate.api.nvidia.com/v1/chat/completions",
7     name="meta/llama-3.1-70b-instruct",
8     api_key_secret="nvidia-api-key",
9 )

For more details on secret management, refer to Managing Secrets.

For local run versus remote submit behavior of api_key_secret, see Model API Authentication.

Job Management

After submitting a durable remote job with evaluator.submit(metric=metric, dataset=dataset), use the returned job resource to monitor execution and retrieve results:

1 job.wait_until_done()
2 result = job.get_result()
3 print(result.aggregate_scores)

Navigate to Metrics Job Management for more job lifecycle details.

Dataset Notes

The current plugin SDK examples use inline dataset rows through dataset=[...]. Keep each row shaped for the selected metric, including fields such as user_input, response, reference, reference_topics, reference_tool_calls, or OpenAI-style tool_calls as required.

Use RunConfig(limit_samples=...) when you want to test a small slice of a larger inline dataset before submitting the full request.

Important Notes

Execution choice: Use run for local in-process evaluation and submit for durable remote jobs with wait_until_done() and get_result().
Column Names: RAGAS metrics use specific column names:

user_input (not question)
response (not answer)
reference (not ground_truth)

Judge Model Quality: For metrics requiring a judge, evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) produce more consistent results.
RAGAS Dependency: These metrics are powered by RAGAS and may have version-specific behavior.
NaN troubleshooting for judge-based metrics: If you see nan_count > 0 with mean = null, check judge model authentication first (API key secret, endpoint access, and model permissions). See Model API Authentication for api_key_secret behavior. Some RAGAS metrics are known to convert auth failures into NaN scores instead of raising a hard error.

Agent Configuration - Use agents (generic or NAT) as targets in online evaluation jobs
Agentic Benchmarks - BFCL benchmark for tool-calling evaluation
LLM-as-a-Judge - Custom judge-based evaluation
Evaluation Results - Understanding results
RAG Metrics - RAGAS metrics for RAG pipelines