Model Configuration

View as Markdown

Online evaluations use Model objects for model endpoints. A model can be the evaluation target that produces outputs, or it can be part of a judge-style metric such as LLM-as-a-Judge, RAG, or agentic metrics.

The Evaluator plugin SDK uses inline model objects from nemo_evaluator_sdk. Pass the model either as target=... or as a field on the metric class that needs a judge or embeddings model.

Initialize the SDK

1import os
2
3from nemo_evaluator.sdk import Evaluator
4from nemo_platform import NeMoPlatform
5
6
7client = NeMoPlatform(
8 base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
9 workspace="default",
10)
11evaluator: Evaluator = client.evaluator # this object is an Evaluator resource

Inline Model

Define the endpoint URL and model name directly:

1from nemo_evaluator_sdk import Model
2
3
4model = Model(
5 url="https://integrate.api.nvidia.com/v1",
6 name="meta/llama-3.1-70b-instruct",
7 format="nim",
8 api_key_secret="NVIDIA_API_KEY",
9)
FieldRequiredDescription
urlYesBase URL of the inference endpoint.
nameYesModel name to send in inference requests.
formatNoAPI format: "nim", "openai", or "llama_stack". Defaults to "nim".
api_key_secretNoModel API key reference. See Model API Authentication.

Model API Authentication

api_key_secret is an optional property on the Model object. Omit it when the endpoint does not require API-key authentication.

For local evaluator.run(...) calls, api_key_secret must name an environment variable available to the local Python process. For example, api_key_secret="NVIDIA_API_KEY" reads os.environ["NVIDIA_API_KEY"].

For remote evaluator.submit(...) jobs, api_key_secret must name a NeMo platform secret in the target workspace. Create the secret before submitting the job:

1client.secrets.create(
2 name="nvidia-api-key",
3 value=os.environ["NVIDIA_API_KEY"],
4)

Model as the Evaluation Target

Use target=model when the evaluator should call the model to generate the sample output before scoring.

1from nemo_evaluator_sdk import (
2 RunConfigOnlineModel,
3 ExactMatchMetric,
4 InferenceParams,
5 Model,
6)
7
8
9model = Model(
10 url="https://integrate.api.nvidia.com/v1",
11 name="meta/llama-3.1-70b-instruct",
12 format="nim",
13 api_key_secret="NVIDIA_API_KEY",
14)
15
16metric = ExactMatchMetric(reference="{{item.expected_answer}}")
17
18result = evaluator.run(
19 metric=metric,
20 dataset=[
21 {"question": "What is the capital of France?", "expected_answer": "Paris"},
22 ],
23 config=RunConfigOnlineModel(
24 parallelism=4,
25 inference=InferenceParams(temperature=0.1, max_tokens=64),
26 ),
27 target=model,
28 prompt_template="Answer this question concisely: {{item.question}}",
29)

Model on a Judge Metric

Use a model field on the metric when the metric itself calls an LLM to score existing outputs.

1from nemo_evaluator_sdk import Model, RangeScore, LLMJudgeMetric
2
3judge_model = Model(
4 url="https://integrate.api.nvidia.com/v1",
5 name="meta/llama-3.1-70b-instruct",
6 format="nim",
7 api_key_secret="NVIDIA_API_KEY",
8)
9metric = LLMJudgeMetric(
10 model=judge_model,
11 scores=[
12 RangeScore(
13 name="correctness",
14 description="Correctness from 1 to 5.",
15 minimum=1,
16 maximum=5,
17 ),
18 ],
19 prompt_template={
20 "messages": [
21 {
22 "role": "system",
23 "content": "Return JSON with a correctness score from 1 to 5.",
24 },
25 {
26 "role": "user",
27 "content": "Question: {{item.question}}\nAnswer: {{item.output}}\nExpected: {{item.expected_answer}}",
28 },
29 ],
30 },
31)
32
33result = evaluator.run(
34 metric=metric,
35 dataset=[
36 {
37 "question": "What is the capital of France?",
38 "output": "Paris",
39 "expected_answer": "Paris",
40 },
41 ],
42)

Runtime Parameters

Use RunConfigOnlineModel for model-target evaluations:

1from nemo_evaluator_sdk import (
2 RunConfigOnlineModel,
3 InferenceParams,
4 ReasoningParams,
5)
6
7
8params = RunConfigOnlineModel(
9 parallelism=4,
10 request_timeout=60,
11 max_retries=2,
12 ignore_request_failure=False,
13 inference=InferenceParams(temperature=0.2, max_tokens=256),
14 reasoning=ReasoningParams(end_token="</think>"),
15)

Use plain RunConfig for offline evaluations where the dataset already contains the output to score.

Model References

The plugin SDK examples on this page use inline Model objects. If your deployment resolves platform model entities into model endpoint details, perform that lookup before constructing the Model, then pass the resulting inline model to the metric or request.

For evaluating agentic systems, use an Agent request target instead of a Model. See Agent Configuration.