Agent Configuration

View as Markdown

Online evaluations can target an agent instead of a model. An agent is an HTTP endpoint that accepts a request and returns a response, optionally with a trajectory of intermediate steps. Use agents when you want to evaluate an agentic system end to end rather than a standalone LLM endpoint.

Provide an Agent as target=.... The metric, prompt template, target, dataset rows, and runtime parameters are all passed through the Evaluator plugin SDK call.

Agent Formats

Two agent formats are supported:

FormatValueDescription
GenericgenericConfigurable HTTP POST with a Jinja-templated request body and JSONPath extraction for response and trajectory.
NeMo Agent Toolkitnemo_agent_toolkitFixed protocol for NeMo Agent Toolkit endpoints.

Initialize the SDK

1import os
2
3from nemo_evaluator.sdk import Evaluator
4from nemo_platform import NeMoPlatform
5
6
7client = NeMoPlatform(
8 base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
9 workspace="default",
10)
11evaluator: Evaluator = client.evaluator # this object is an Evaluator resource

Managing Secrets for Agent Endpoints

If your agent endpoint requires authentication, configure api_key_secret on the Agent.

For local evaluator.run(...) calls, api_key_secret must name an environment variable available to the local Python process. For remote evaluator.submit(...) jobs, it must name a NeMo platform secret in the target workspace. See Model API Authentication for the local-versus-remote behavior.

For remote evaluator.submit(...) jobs, create the secret in the platform workspace before submitting the job:

1client.secrets.create(
2 name="my-agent-api-key",
3 value=os.environ["MY_AGENT_API_KEY"],
4)

The secret name may be a workspace-local name such as "my-agent-api-key" or a full reference such as "my-workspace/my-agent-api-key" for remote jobs.

Generic Agent

A generic agent is any HTTP endpoint that:

  1. Accepts a POST request with Content-Type: application/json.
  2. Returns a JSON response containing the answer and, optionally, a trajectory.

You control the request shape with body and extract values from the response with JSONPath expressions.

Generic Agent Fields

FieldRequiredTypeDescription
urlYesstringBase URL of the agent endpoint.
nameYesstringAgent name or identifier.
formatNostringgeneric (default) or nemo_agent_toolkit.
api_key_secretNostringAPI key reference. See Model API Authentication.
bodyYesdictJinja template for the request payload. Use {{ prompt }}, {{ messages }}, or fields from the rendered prompt context.
response_pathYesstringJSONPath expression to extract the response text.
trajectory_pathNostringJSONPath expression to extract the trajectory.

Run a Generic Agent Evaluation

1from nemo_evaluator_sdk import Agent, RunConfigOnline
2from nemo_evaluator_sdk import ExactMatchMetric
3metric = ExactMatchMetric(reference="{{item.expected_answer}}")
4agent = Agent(
5 url="https://my-agent.example.com/invoke",
6 name="qa-agent",
7 format="generic",
8 api_key_secret="MY_AGENT_API_KEY",
9 body={"question": "{{ prompt }}"},
10 response_path="$.answer",
11 trajectory_path="$.reasoning_steps",
12)
13
14result = evaluator.run(
15 metric=metric,
16 dataset=[
17 {"question": "What is the capital of France?", "expected_answer": "Paris"},
18 ],
19 config=RunConfigOnline(parallelism=4, request_timeout=60, max_retries=2),
20 target=agent,
21 prompt_template="Question: {{item.question}}\nAnswer:",
22)
23for score in result.aggregate_scores.scores:
24 print(f"{score.name}: mean={score.mean}")

Use evaluator.submit(...) with the same argument shape when you want a durable remote job, but set api_key_secret to a platform secret name for the target workspace.

Example Generic Agent Endpoint

Your agent endpoint might look like this:

1from fastapi import FastAPI
2from pydantic import BaseModel
3
4app = FastAPI()
5
6
7class AgentRequest(BaseModel):
8 question: str
9
10
11class AgentResponse(BaseModel):
12 answer: str
13 reasoning_steps: list[dict]
14
15
16@app.post("/invoke")
17async def invoke(request: AgentRequest) -> AgentResponse:
18 return AgentResponse(
19 answer="Paris",
20 reasoning_steps=[
21 {"step": "search", "result": "Found relevant documents"},
22 {"step": "synthesize", "result": "Generated answer from context"},
23 ],
24 )

NeMo Agent Toolkit Agent

Use the nemo_agent_toolkit format when evaluating agents built with the NeMo Agent Toolkit. This format uses the NAT streaming protocol:

  1. Sends a POST to {url}/generate/full?filter_steps=none with {"input_message": "<text>"}.
  2. Reads the SSE (Server-Sent Events) stream.
  3. Extracts the final value from the last SSE data: chunk.
  4. Returns it as the agent response.

NeMo Agent Toolkit Fields

FieldRequiredTypeDescription
urlYesstringBase URL of the agent endpoint.
nameYesstringAgent name or identifier.
formatYesstringSet to nemo_agent_toolkit.
api_key_secretNostringAPI key reference. See Model API Authentication.

Run a NAT Agent Evaluation

1from nemo_evaluator_sdk import Agent, RunConfigOnline
2
3
4from nemo_evaluator_sdk import ExactMatchMetric
5metric = ExactMatchMetric(reference="{{item.expected_answer}}")
6agent = Agent(
7 url="https://my-nat-agent.example.com",
8 name="nat-research-agent",
9 format="nemo_agent_toolkit",
10 api_key_secret="my-nat-api-key",
11)
12
13job = evaluator.submit(
14 metric=metric,
15 dataset=[
16 {"question": "What is the capital of France?", "expected_answer": "Paris"},
17 ],
18 config=RunConfigOnline(parallelism=4),
19 target=agent,
20 prompt_template={
21 "messages": [
22 {"role": "user", "content": "{{item.question}}"},
23 ],
24 },
25)
26job.wait_until_done()
27result = job.get_result()

Model vs Agent: When to Use Which

Use CaseUse ModelUse Agent
Evaluate a standalone LLM endpointx
Evaluate an agentic system with tool use and multi-step reasoningx
Evaluate a NeMo Agent Toolkit workflowx
Evaluate a custom HTTP endpoint with non-standard response formatx
Use a standard chat completions APIx

Online evaluations accept either a model or an agent as the request target, never both.