RAG Evaluation Metrics | NVIDIA NeMo Platform

RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.

Overview

RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both offline and online evaluation modes:

Offline evaluation: Uses pre-generated responses from your dataset
Online evaluation: Responses are generated automatically using a model and prompt template before evaluation

Job’s model and prompt_template are used to generate responses
Generated response (in sample["output_text"]) is automatically used as response in RAGAS evaluation
RAG context variables can be included in the job’s prompt_template:

{{user_input}} - User question/input from dataset
{{retrieved_contexts}} - Retrieved context passages from dataset

RAGAS metrics require:

Judge LLM: An LLM to evaluate answer quality (required for most metrics)
Judge Embeddings (optional): Required for some metrics like response_relevancy
Data: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)

Prerequisites

Before running RAG evaluations:

Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model Endpoints: Access to judge LLM endpoints (and embeddings model for some metrics)
API Keys (if required): Create secrets for any endpoints requiring authentication
Initialize the SDK:

1 import os
2 
3 from nemo_evaluator_sdk import RunConfigOnlineModel, RunConfig, InferenceParams, Model
4 from nemo_evaluator_sdk.metrics.ragas import (
5     ContextEntityRecallMetric,
6     ContextPrecisionMetric,
7     ContextRecallMetric,
8     ContextRelevanceMetric,
9     FaithfulnessMetric,
10     NoiseSensitivityMetric,
11     ResponseGroundednessMetric,
12     ResponseRelevancyMetric,
13 )
14 from nemo_evaluator.sdk import Evaluator
15 from nemo_platform import NeMoPlatform
16 
17 client = NeMoPlatform(
18     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
19     workspace="default",
20 )
21 evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource

Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:

1 job = evaluator.submit(metric=metric, dataset=dataset)
2 job.wait_until_done()
3 result = job.get_result()

Creating a Secret for API Keys

If using external endpoints that require authentication, such as NVIDIA Build endpoints, create a secret first:

1 client.secrets.create(
2     name="nvidia-api-key",
3     value="nvapi-YOUR_API_KEY_HERE",
4     description="NVIDIA Build API key for RAG metrics",
5 )

Reference secrets by name in your model configuration. For local run versus remote submit behavior, see Model API Authentication.

1 judge_model = Model(
2     url="https://integrate.api.nvidia.com/v1/chat/completions",
3     name="meta/llama-3.1-70b-instruct",
4     api_key_secret="nvidia-api-key",
5 )

RAGAS metrics accept inline model definitions for judge_model and, where required, embeddings_model.

See Model Configuration for details.

Supported RAGAS Metrics

Use Case	Metric Type	Description	Required Columns*
Measure retrieval quality	`context_recall`	Coverage of reference information in retrieved context	user_input, retrieved_contexts, reference
	`context_precision`	Whether all retrieved chunks are relevant to the question	user_input, retrieved_contexts, reference
	`context_relevance`	Relevance of retrieved context to the question	user_input, retrieved_contexts
	`context_entity_recall`	Recall of important entities from reference in context	retrieved_contexts, reference
Detect hallucinations	`faithfulness`	Measures factual consistency of response with retrieved context	user_input, response, retrieved_contexts
	`response_groundedness`	Evaluates whether response is grounded in context without hallucinations	response, retrieved_contexts
	`noise_sensitivity`	Robustness to noisy or irrelevant context	user_input, response, reference, retrieved_contexts
Check if answers address the question	`response_relevancy`**	Response relevancy to question using embeddings similarity	user_input, response, retrieved_contexts

* Required Columns: Dataset columns that must be present for the metric to be evaluated.

** Requires embeddings_model in addition to judge_model.

Shared Example Setup

The metric examples below use these inline values:

For local run versus remote submit behavior of api_key_secret, see Model API Authentication.

1 judge_model = Model(
2     url="https://integrate.api.nvidia.com/v1/chat/completions",
3     name="meta/llama-3.1-70b-instruct",
4     api_key_secret="nvidia-api-key",
5 )
6 
7 embeddings_model = Model(
8     url="https://integrate.api.nvidia.com/v1/embeddings",
9     name="nvidia/nv-embedqa-e5-v5",
10     api_key_secret="nvidia-api-key",
11 )
12 
13 generation_model = Model(
14     url="https://integrate.api.nvidia.com/v1/chat/completions",
15     name="nvidia/llama-3.3-nemotron-super-49b-v1",
16     api_key_secret="nvidia-api-key",
17 )

Use offline rows when your RAG pipeline has already produced responses:

1 offline_rows = [
2     {
3         "user_input": "What is the capital of France?",
4         "retrieved_contexts": ["Paris is the capital and largest city of France."],
5         "response": "The capital of France is Paris.",
6         "reference": "Paris is the capital of France.",
7     }
8 ]

Use online arguments when the evaluator should generate the response first:

1 online_dataset = [
2     {
3         "user_input": "What is the capital of France?",
4         "retrieved_contexts": ["Paris is the capital and largest city of France."],
5         "reference": "Paris is the capital of France.",
6     }
7 ]
8 online_prompt_template = {
9     "messages": [
10         {
11             "role": "user",
12             "content": "Context:\n{{item.retrieved_contexts | join('\n\n')}}\n\nQuestion: {{item.user_input}}\n\nAnswer:",
13         }
14     ]
15 }
16 online_config = RunConfigOnlineModel(
17     parallelism=8,
18     inference=InferenceParams(temperature=0.2, max_tokens=1024),
19 )

Context Recall

Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.

Score name: context_recall
Score range: 0 to 1, with higher scores indicating better recall.

Data Format

1 {
2   "user_input": "What is the capital of France?",
3   "retrieved_contexts": [
4     "Paris is the capital and largest city of France."
5   ],
6   "reference": "Paris is the capital of France."
7 }

Local Evaluation

Remote Job

Result

1 metric = ContextRecallMetric(judge_model=judge_model)
2 
3 result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
4 
5 for score in result.aggregate_scores.scores:
6     print(f"{score.name}: mean={score.mean}")

Context Precision

Measures the proportion of relevant chunks in the retrieved contexts (precision@k).

Score name: context_precision
Score range: 0 to 1, with higher scores indicating better precision.

Data Format

1 {
2   "user_input": "What is the capital of France?",
3   "retrieved_contexts": [
4     "Paris is the capital and largest city of France."
5   ],
6   "reference": "Paris"
7 }

Local Evaluation

Remote Job

Result

1 metric = ContextPrecisionMetric(judge_model=judge_model)
2 
3 result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
4 
5 for score in result.aggregate_scores.scores:
6     print(f"{score.name}: mean={score.mean}")

Context Relevance

Measures how relevant the retrieved contexts are to the user input.

Score name: context_relevance
Score range: 0 to 1, with higher scores indicating better relevance.

Data Format

1 {
2   "user_input": "What is the capital of France?",
3   "retrieved_contexts": [
4     "Paris is the capital and largest city of France."
5   ]
6 }

Local Evaluation

Remote Job

1 metric = ContextRelevanceMetric(judge_model=judge_model)
2 
3 result = evaluator.run(
4     metric=metric,
5     dataset=[
6         {
7             "user_input": "What is the capital of France?",
8             "retrieved_contexts": ["Paris is the capital and largest city of France."],
9         }
10     ],
11     config=RunConfig(parallelism=8),
12 )
13 
14 for score in result.aggregate_scores.scores:
15     print(f"{score.name}: mean={score.mean}")

Context Entity Recall

Measures how many important entities from the reference are present in the retrieved contexts.

Score name: context_entity_recall
Score range: 0 to 1, with higher scores indicating better entity recall.

Data Format

1 {
2   "retrieved_contexts": [
3     "Paris is the capital and largest city of France."
4   ],
5   "reference": "Paris is the capital of France."
6 }

Local Evaluation

Remote Job

1 metric = ContextEntityRecallMetric(judge_model=judge_model)
2 
3 result = evaluator.run(
4     metric=metric,
5     dataset=[
6         {
7             "retrieved_contexts": ["Paris is the capital and largest city of France."],
8             "reference": "Paris is the capital of France.",
9         }
10     ],
11     config=RunConfig(parallelism=8),
12 )
13 
14 for score in result.aggregate_scores.scores:
15     print(f"{score.name}: mean={score.mean}")

Faithfulness

Measures factual consistency of the response with the retrieved context.

Score name: faithfulness
Score range: 0 to 1, with higher scores indicating the response is more faithful to the context.

Data Format

1 {
2   "user_input": "What is the capital of France?",
3   "retrieved_contexts": [
4     "Paris is the capital and largest city of France."
5   ],
6   "response": "The capital of France is Paris."
7 }

Local Evaluation

Online Evaluation

Remote Job

1 metric = FaithfulnessMetric(judge_model=judge_model)
2 
3 result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
4 
5 for score in result.aggregate_scores.scores:
6     print(f"{score.name}: mean={score.mean}")

Response Groundedness

Evaluates whether the response is grounded in the retrieved context without hallucinations.

Score name: response_groundedness
Score range: 0 to 1, with higher scores indicating stronger grounding.

Data Format

1 {
2   "retrieved_contexts": [
3     "Paris is the capital and largest city of France."
4   ],
5   "response": "The capital of France is Paris."
6 }

Local Evaluation

Online Evaluation

Remote Job

1 metric = ResponseGroundednessMetric(judge_model=judge_model)
2 
3 result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
4 
5 for score in result.aggregate_scores.scores:
6     print(f"{score.name}: mean={score.mean}")

Noise Sensitivity

Measures robustness when retrieved contexts contain noisy or irrelevant information.

Score name: noise_sensitivity
Score range: 0 to 1. Lower scores usually indicate the response is less sensitive to noise.

Data Format

1 {
2   "user_input": "What is the capital of France?",
3   "retrieved_contexts": [
4     "Paris is the capital and largest city of France.",
5     "Berlin is the capital of Germany."
6   ],
7   "response": "The capital of France is Paris.",
8   "reference": "Paris is the capital of France."
9 }

Local Evaluation

Remote Job

1 metric = NoiseSensitivityMetric(judge_model=judge_model)
2 
3 result = evaluator.run(
4     metric=metric,
5     dataset=[
6         {
7             "user_input": "What is the capital of France?",
8             "retrieved_contexts": [
9                 "Paris is the capital and largest city of France.",
10                 "Berlin is the capital of Germany.",
11             ],
12             "response": "The capital of France is Paris.",
13             "reference": "Paris is the capital of France.",
14         }
15     ],
16     config=RunConfig(parallelism=8),
17 )
18 
19 for score in result.aggregate_scores.scores:
20     print(f"{score.name}: mean={score.mean}")

Response Relevancy

Measures how relevant a response is to the user input using generated questions and embedding similarity.

Score name: response_relevancy
Score range: 0 to 1, with higher scores indicating better relevancy.

Data Format

1 {
2   "user_input": "What is the capital of France?",
3   "retrieved_contexts": [
4     "Paris is the capital and largest city of France."
5   ],
6   "response": "The capital of France is Paris."
7 }

Configuration Options

Parameter	Type	Default	Description
`strictness`	int	`1`	Number of parallel questions generated. NIM supports `1`.

Local Evaluation

Online Evaluation

Remote Job

1 metric = ResponseRelevancyMetric(
2     judge_model=judge_model,
3     embeddings_model=embeddings_model,
4     strictness=1,
5 )
6 
7 result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
8 
9 for score in result.aggregate_scores.scores:
10     print(f"{score.name}: mean={score.mean}")

Dataset Format

RAGAS metrics use specific column names:

Field	Type	Required	Description
`user_input`	string	Yes	User question or input
`retrieved_contexts`	list[string]	Some metrics	List of context passages
`response`	string	Some metrics	Generated answer. Required for offline response-quality metrics; generated as `sample.output_text` for online model requests.
`reference`	string	Some metrics	Reference answer or ground truth

Different metrics require different columns. Check the metric documentation for specific requirements.

Example Dataset

1 {
2   "user_input": "What is the capital of France?",
3   "retrieved_contexts": [
4     "Paris is the capital and largest city of France."
5   ],
6   "response": "The capital of France is Paris.",
7   "reference": "Paris"
8 }

Response Format

All evaluation responses follow this structure:

1 {
2   "metric": {
3     "type": "faithfulness",
4     "judge_model": {
5       "url": "...",
6       "name": "..."
7     }
8   },
9   "aggregate_scores": {
10     "scores": [
11       {
12         "name": "faithfulness",
13         "count": 1,
14         "mean": 0.95,
15         "min": 0.95,
16         "max": 0.95,
17         "sum": 0.95
18       }
19     ]
20   },
21   "row_scores": [
22     {
23       "row_index": 0,
24       "metrics": {
25         "faithfulness": [
26           {"name": "faithfulness", "value": 0.95}
27         ]
28       },
29       "error": null
30     }
31   ]
32 }

Working with Results

1 # Access aggregate scores
2 for score in result.aggregate_scores.scores:
3     print(f"{score.name}: mean={score.mean}, count={score.count}")
4 
5 # Access per-row scores
6 for row in result.row_scores:
7     if row.metrics:
8         print(f"Row {row.row_index}: {row.metrics}")
9     elif row.error:
10         print(f"Row {row.row_index} failed: {row.error}")

Managing Secrets for Authenticated Endpoints

Store API keys as secrets for secure authentication:

1 client.secrets.create(name="judge-api-key", value="<your-judge-key>")
2 client.secrets.create(name="embedding-api-key", value="<your-embedding-key>")

Reference secrets by name in your metric configuration. For local run versus remote submit behavior, see Model API Authentication.

1 judge_model = Model(
2     url="https://integrate.api.nvidia.com/v1/chat/completions",
3     name="meta/llama-3.1-70b-instruct",
4     api_key_secret="judge-api-key",
5 )

Job Management

For durable remote execution, submit the same metric and dataset that you tested locally:

1 job = evaluator.submit(metric=metric, dataset=dataset)
2 job.wait_until_done()
3 artifacts_dir = job.download_artifacts(path="evaluation_artifacts")
4 print(f"Saved artifacts under {artifacts_dir}")

Troubleshooting

Common Errors

Error	Cause	Solution
`judge_model` is required	Missing judge LLM config for metric	Add `judge_model` to metric configuration
`embeddings_model` is required	Using `response_relevancy` without embeddings	Add `embeddings_model` to metric configuration
Job stuck in “pending”	Model endpoint not accessible	Verify endpoint URLs and API key secrets. See Model API Authentication
Authentication failed	Invalid or missing API key	Check `api_key_secret` for the execution mode. See Model API Authentication
`nan_count > 0` and `mean = null`	Judge/model call failures, such as auth, endpoint, quota, or timeout. Some RAGAS metrics are known to return `NaN` instead of raising on these failures.	Inspect row-level `error`; verify API key, endpoint, and model access
Low faithfulness scores	Context doesn’t support the response	Improve retrieval or response generation

If you see nan_count > 0 with mean = null, first validate judge model authentication.

For some RAGAS metrics, auth failures can be converted to NaN scores instead of surfacing as a hard error.

Tips for Better Results

Use larger judge models (70B+) for more consistent scoring.
Start with inline datasets to test your configuration before large evaluations.
Set appropriate timeouts - judge LLM calls can take time with large contexts.
Use parallelism wisely - increase parallelism for faster evaluation, but respect rate limits.
Column names matter - RAGAS metrics use user_input, retrieved_contexts, response, and reference.

Important Notes

Secret Management: API keys should be referenced through api_key_secret, with different local run and remote submit behavior. See Model API Authentication. Never pass API keys directly in the request.
Column Names: RAGAS metrics use specific column names:
- user_input (not question)
- response (not answer)
- retrieved_contexts (not contexts)
- reference (not ground_truth)
Embeddings Model: Only response_relevancy requires an embeddings model. All other metrics use only the judge LLM.

Limitations

Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.
Dataset Format: RAGAS metrics use specific column names (user_input, retrieved_contexts, response, reference). Ensure your data matches this structure.

LLM-as-a-Judge - Custom judge-based evaluation
Agentic Metrics - Evaluate agent workflows