Agent Configuration | NVIDIA NeMo Platform

Online evaluations can target an agent instead of a model. An agent is an HTTP endpoint that accepts a request and returns a response, optionally with a trajectory of intermediate steps. Use agents when you want to evaluate an agentic system end to end rather than a standalone LLM endpoint.

Provide an Agent as target=.... The metric, prompt template, target, dataset rows, and runtime parameters are all passed through the Evaluator plugin SDK call.

Agent Formats

Two agent formats are supported:

Format	Value	Description
Generic	`generic`	Configurable HTTP POST with a Jinja-templated request body and JSONPath extraction for response and trajectory.
NeMo Agent Toolkit	`nemo_agent_toolkit`	Fixed protocol for NeMo Agent Toolkit endpoints.

Initialize the SDK

1 import os
2 
3 from nemo_evaluator.sdk import Evaluator
4 from nemo_platform import NeMoPlatform
5 
6 
7 client = NeMoPlatform(
8     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
9     workspace="default",
10 )
11 evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Managing Secrets for Agent Endpoints

If your agent endpoint requires authentication, configure api_key_secret on the Agent.

For local evaluator.run(...) calls, api_key_secret must name an environment variable available to the local Python process. For remote evaluator.submit(...) jobs, it must name a NeMo platform secret in the target workspace. See Model API Authentication for the local-versus-remote behavior.

For remote evaluator.submit(...) jobs, create the secret in the platform workspace before submitting the job:

1 client.secrets.create(
2     name="my-agent-api-key",
3     value=os.environ["MY_AGENT_API_KEY"],
4 )

The secret name may be a workspace-local name such as "my-agent-api-key" or a full reference such as "my-workspace/my-agent-api-key" for remote jobs.

Generic Agent

A generic agent is any HTTP endpoint that:

Accepts a POST request with Content-Type: application/json.
Returns a JSON response containing the answer and, optionally, a trajectory.

You control the request shape with body and extract values from the response with JSONPath expressions.

Generic Agent Fields

Field	Required	Type	Description
`url`	Yes	string	Base URL of the agent endpoint.
`name`	Yes	string	Agent name or identifier.
`format`	No	string	`generic` (default) or `nemo_agent_toolkit`.
`api_key_secret`	No	string	API key reference. See Model API Authentication.
`body`	Yes	dict	Jinja template for the request payload. Use `{{ prompt }}`, `{{ messages }}`, or fields from the rendered prompt context.
`response_path`	Yes	string	JSONPath expression to extract the response text.
`trajectory_path`	No	string	JSONPath expression to extract the trajectory.

Run a Generic Agent Evaluation

1 from nemo_evaluator_sdk import Agent, RunConfigOnline
2 from nemo_evaluator_sdk import ExactMatchMetric
3 metric = ExactMatchMetric(reference="{{item.expected_answer}}")
4 agent = Agent(
5     url="https://my-agent.example.com/invoke",
6     name="qa-agent",
7     format="generic",
8     api_key_secret="MY_AGENT_API_KEY",
9     body={"question": "{{ prompt }}"},
10     response_path="$.answer",
11     trajectory_path="$.reasoning_steps",
12 )
13 
14 result = evaluator.run(
15     metric=metric,
16     dataset=[
17         {"question": "What is the capital of France?", "expected_answer": "Paris"},
18     ],
19     config=RunConfigOnline(parallelism=4, request_timeout=60, max_retries=2),
20     target=agent,
21     prompt_template="Question: {{item.question}}\nAnswer:",
22 )
23 for score in result.aggregate_scores.scores:
24     print(f"{score.name}: mean={score.mean}")

Use evaluator.submit(...) with the same argument shape when you want a durable remote job, but set api_key_secret to a platform secret name for the target workspace.

Example Generic Agent Endpoint

Your agent endpoint might look like this:

1 from fastapi import FastAPI
2 from pydantic import BaseModel
3 
4 app = FastAPI()
5 
6 
7 class AgentRequest(BaseModel):
8     question: str
9 
10 
11 class AgentResponse(BaseModel):
12     answer: str
13     reasoning_steps: list[dict]
14 
15 
16 @app.post("/invoke")
17 async def invoke(request: AgentRequest) -> AgentResponse:
18     return AgentResponse(
19         answer="Paris",
20         reasoning_steps=[
21             {"step": "search", "result": "Found relevant documents"},
22             {"step": "synthesize", "result": "Generated answer from context"},
23         ],
24     )

NeMo Agent Toolkit Agent

Use the nemo_agent_toolkit format when evaluating agents built with the NeMo Agent Toolkit. This format uses the NAT streaming protocol:

Sends a POST to {url}/generate/full?filter_steps=none with {"input_message": "<text>"}.
Reads the SSE (Server-Sent Events) stream.
Extracts the final value from the last SSE data: chunk.
Returns it as the agent response.

NeMo Agent Toolkit Fields

Field	Required	Type	Description
`url`	Yes	string	Base URL of the agent endpoint.
`name`	Yes	string	Agent name or identifier.
`format`	Yes	string	Set to `nemo_agent_toolkit`.
`api_key_secret`	No	string	API key reference. See Model API Authentication.

Run a NAT Agent Evaluation

1 from nemo_evaluator_sdk import Agent, RunConfigOnline
2 
3 
4 from nemo_evaluator_sdk import ExactMatchMetric
5 metric = ExactMatchMetric(reference="{{item.expected_answer}}")
6 agent = Agent(
7     url="https://my-nat-agent.example.com",
8     name="nat-research-agent",
9     format="nemo_agent_toolkit",
10     api_key_secret="my-nat-api-key",
11 )
12 
13 job = evaluator.submit(
14     metric=metric,
15     dataset=[
16         {"question": "What is the capital of France?", "expected_answer": "Paris"},
17     ],
18     config=RunConfigOnline(parallelism=4),
19     target=agent,
20     prompt_template={
21         "messages": [
22             {"role": "user", "content": "{{item.question}}"},
23         ],
24     },
25 )
26 job.wait_until_done()
27 result = job.get_result()

Model vs Agent: When to Use Which

Use Case	Use `Model`	Use `Agent`
Evaluate a standalone LLM endpoint	x
Evaluate an agentic system with tool use and multi-step reasoning		x
Evaluate a NeMo Agent Toolkit workflow		x
Evaluate a custom HTTP endpoint with non-standard response format		x
Use a standard chat completions API	x

Online evaluations accept either a model or an agent as the request target, never both.

Model Configuration - Inline model targets for LLM endpoints.
Agentic Evaluation Metrics - Metrics for evaluating agent tool calling, goal accuracy, and trajectory.
LLM-as-a-Judge - Custom judge-based evaluation with flexible scoring criteria.
Bring Your Own Metric - Integrate custom evaluation endpoints.