> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://nemo-platform.docs.buildwithfern.com/nemo/platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://nemo-platform.docs.buildwithfern.com/nemo/platform/_mcp/server.

# RAG Evaluation Metrics

<a id="eval-metrics-rag" />

<a id="eval-flow-rag" />

RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use [RAGAS](https://github.com/explodinggradients/ragas) to assess how well retrieved contexts support generated answers.

## Overview

RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both **offline** and **online** evaluation modes:

* **Offline evaluation**: Uses pre-generated responses from your dataset
* **Online evaluation**: Responses are generated automatically using a model and prompt template before evaluation

1. **Job's model and prompt\_template** are used to generate responses
2. **Generated response** (in `sample["output_text"]`) is automatically used as `response` in RAGAS evaluation
3. **RAG context variables** can be included in the job's `prompt_template`:

* `{{user_input}}` - User question/input from dataset
* `{{retrieved_contexts}}` - Retrieved context passages from dataset

RAGAS metrics require:

* **Judge LLM**: An LLM to evaluate answer quality (required for most metrics)
* **Judge Embeddings** (optional): Required for some metrics like `response_relevancy`
* **Data**: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)

## Prerequisites

Before running RAG evaluations:

1. **Workspace**: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
2. **Model Endpoints**: Access to judge LLM endpoints (and embeddings model for some metrics)
3. **API Keys (if required)**: [Create secrets](#managing-secrets-for-authenticated-endpoints) for any endpoints requiring authentication
4. **Initialize the SDK**:

```python
import os

from nemo_evaluator_sdk import RunConfigOnlineModel, RunConfig, InferenceParams, Model
from nemo_evaluator_sdk.metrics.ragas import (
    ContextEntityRecallMetric,
    ContextPrecisionMetric,
    ContextRecallMetric,
    ContextRelevanceMetric,
    FaithfulnessMetric,
    NoiseSensitivityMetric,
    ResponseGroundednessMetric,
    ResponseRelevancyMetric,
)
from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = client.evaluator  # this object is an Evaluator resource
```

Use `evaluator.run(metric=metric, dataset=dataset)` for a local synchronous evaluation. Use `evaluator.submit(metric=metric, dataset=dataset)` when you need a durable remote job:

```python
job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
result = job.get_result()
```

### Creating a Secret for API Keys

If using external endpoints that require authentication, such as NVIDIA Build endpoints, create a secret first:

```python
client.secrets.create(
    name="nvidia-api-key",
    value="nvapi-YOUR_API_KEY_HERE",
    description="NVIDIA Build API key for RAG metrics",
)
```

Reference secrets by name in your model configuration. For local `run` versus remote `submit` behavior, see [Model API Authentication](/documentation/evaluate-models/metrics/model-configuration#model-api-authentication).

```python
judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="nvidia-api-key",
)
```

RAGAS metrics accept inline model definitions for `judge_model` and, where required, `embeddings_model`.

See [Model Configuration](/documentation/evaluate-models/metrics/model-configuration) for details.

***

## Supported RAGAS Metrics

| Use Case                                  | Metric Type                                       | Description                                                              | Required Columns\*                                    |
| :---------------------------------------- | :------------------------------------------------ | :----------------------------------------------------------------------- | :---------------------------------------------------- |
| **Measure retrieval quality**             | [`context_recall`](#context-recall)               | Coverage of reference information in retrieved context                   | user\_input, retrieved\_contexts, reference           |
|                                           | [`context_precision`](#context-precision)         | Whether all retrieved chunks are relevant to the question                | user\_input, retrieved\_contexts, reference           |
|                                           | [`context_relevance`](#context-relevance)         | Relevance of retrieved context to the question                           | user\_input, retrieved\_contexts                      |
|                                           | [`context_entity_recall`](#context-entity-recall) | Recall of important entities from reference in context                   | retrieved\_contexts, reference                        |
| **Detect hallucinations**                 | [`faithfulness`](#faithfulness)                   | Measures factual consistency of response with retrieved context          | user\_input, response, retrieved\_contexts            |
|                                           | [`response_groundedness`](#response-groundedness) | Evaluates whether response is grounded in context without hallucinations | response, retrieved\_contexts                         |
|                                           | [`noise_sensitivity`](#noise-sensitivity)         | Robustness to noisy or irrelevant context                                | user\_input, response, reference, retrieved\_contexts |
| **Check if answers address the question** | [`response_relevancy`\*\*](#response-relevancy)   | Response relevancy to question using embeddings similarity               | user\_input, response, retrieved\_contexts            |

\* *Required Columns: Dataset columns that must be present for the metric to be evaluated.*

\*\* *Requires `embeddings_model` in addition to `judge_model`.*

***

## Shared Example Setup

The metric examples below use these inline values:

For local `run` versus remote `submit` behavior of `api_key_secret`, see [Model API Authentication](/documentation/evaluate-models/metrics/model-configuration#model-api-authentication).

```python
judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="nvidia-api-key",
)

embeddings_model = Model(
    url="https://integrate.api.nvidia.com/v1/embeddings",
    name="nvidia/nv-embedqa-e5-v5",
    api_key_secret="nvidia-api-key",
)

generation_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="nvidia/llama-3.3-nemotron-super-49b-v1",
    api_key_secret="nvidia-api-key",
)
```

Use offline rows when your RAG pipeline has already produced responses:

```python
offline_rows = [
    {
        "user_input": "What is the capital of France?",
        "retrieved_contexts": ["Paris is the capital and largest city of France."],
        "response": "The capital of France is Paris.",
        "reference": "Paris is the capital of France.",
    }
]
```

Use online arguments when the evaluator should generate the response first:

```python
online_dataset = [
    {
        "user_input": "What is the capital of France?",
        "retrieved_contexts": ["Paris is the capital and largest city of France."],
        "reference": "Paris is the capital of France.",
    }
]
online_prompt_template = {
    "messages": [
        {
            "role": "user",
            "content": "Context:\n{{item.retrieved_contexts | join('\n\n')}}\n\nQuestion: {{item.user_input}}\n\nAnswer:",
        }
    ]
}
online_config = RunConfigOnlineModel(
    parallelism=8,
    inference=InferenceParams(temperature=0.2, max_tokens=1024),
)
```

***

## Context Recall

Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.

* Score name: `context_recall`
* Score range: 0 to 1, with higher scores indicating better recall.

### Data Format

```json
{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris is the capital of France."
}
```

```python
metric = ContextRecallMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = ContextRecallMetric(judge_model=judge_model)

job = evaluator.submit(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
job.wait_until_done()
result = job.get_result()
```

```json
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_recall",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "percentiles": {},
      "score_type": "range",
      "std_dev": 0.0,
      "sum": 1.0,
      "variance": 0.0
    }
  ]
}
```

***

## Context Precision

Measures the proportion of relevant chunks in the retrieved contexts (precision\@k).

* Score name: `context_precision`
* Score range: 0 to 1, with higher scores indicating better precision.

### Data Format

```json
{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris"
}
```

```python
metric = ContextPrecisionMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = ContextPrecisionMetric(judge_model=judge_model)

job = evaluator.submit(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))
job.wait_until_done()
result = job.get_result()
```

```json
{
  "scores": [
    {
      "count": 1,
      "histogram": {},
      "name": "context_precision",
      "nan_count": 0,
      "max": 1.0,
      "mean": 1.0,
      "min": 1.0
    }
  ]
}
```

***

## Context Relevance

Measures how relevant the retrieved contexts are to the user input.

* Score name: `context_relevance`
* Score range: 0 to 1, with higher scores indicating better relevance.

### Data Format

```json
{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ]
}
```

```python
metric = ContextRelevanceMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = ContextRelevanceMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()
```

***

## Context Entity Recall

Measures how many important entities from the reference are present in the retrieved contexts.

* Score name: `context_entity_recall`
* Score range: 0 to 1, with higher scores indicating better entity recall.

### Data Format

```json
{
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "reference": "Paris is the capital of France."
}
```

```python
metric = ContextEntityRecallMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = ContextEntityRecallMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()
```

***

## Faithfulness

Measures factual consistency of the response with the retrieved context.

* Score name: `faithfulness`
* Score range: 0 to 1, with higher scores indicating the response is more faithful to the context.

### Data Format

```json
{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}
```

```python
metric = FaithfulnessMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = FaithfulnessMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = FaithfulnessMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()
```

***

## Response Groundedness

Evaluates whether the response is grounded in the retrieved context without hallucinations.

* Score name: `response_groundedness`
* Score range: 0 to 1, with higher scores indicating stronger grounding.

### Data Format

```json
{
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}
```

```python
metric = ResponseGroundednessMetric(judge_model=judge_model)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = ResponseGroundednessMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = ResponseGroundednessMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()
```

***

## Noise Sensitivity

Measures robustness when retrieved contexts contain noisy or irrelevant information.

* Score name: `noise_sensitivity`
* Score range: 0 to 1. Lower scores usually indicate the response is less sensitive to noise.

### Data Format

```json
{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France.",
    "Berlin is the capital of Germany."
  ],
  "response": "The capital of France is Paris.",
  "reference": "Paris is the capital of France."
}
```

```python
metric = NoiseSensitivityMetric(judge_model=judge_model)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": [
                "Paris is the capital and largest city of France.",
                "Berlin is the capital of Germany.",
            ],
            "response": "The capital of France is Paris.",
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = NoiseSensitivityMetric(judge_model=judge_model)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "user_input": "What is the capital of France?",
            "retrieved_contexts": [
                "Paris is the capital and largest city of France.",
                "Berlin is the capital of Germany.",
            ],
            "response": "The capital of France is Paris.",
            "reference": "Paris is the capital of France.",
        }
    ],
    config=RunConfig(parallelism=8),
)
job.wait_until_done()
result = job.get_result()
```

***

## Response Relevancy

Measures how relevant a response is to the user input using generated questions and embedding similarity.

* Score name: `response_relevancy`
* Score range: 0 to 1, with higher scores indicating better relevancy.

### Data Format

```json
{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris."
}
```

### Configuration Options

| Parameter    | Type | Default | Description                                               |
| ------------ | ---- | ------- | --------------------------------------------------------- |
| `strictness` | int  | `1`     | Number of parallel questions generated. NIM supports `1`. |

```python
metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

result = evaluator.run(metric=metric, dataset=offline_rows, config=RunConfig(parallelism=8))

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

result = evaluator.run(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
metric = ResponseRelevancyMetric(
    judge_model=judge_model,
    embeddings_model=embeddings_model,
    strictness=1,
)

job = evaluator.submit(
    metric=metric,
    dataset=online_dataset,
    config=online_config,
    target=generation_model,
    prompt_template=online_prompt_template,
)
job.wait_until_done()
result = job.get_result()
```

***

## Dataset Format

RAGAS metrics use specific column names:

| Field                | Type          | Required     | Description                                                                                                                   |
| -------------------- | ------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------- |
| `user_input`         | string        | Yes          | User question or input                                                                                                        |
| `retrieved_contexts` | list\[string] | Some metrics | List of context passages                                                                                                      |
| `response`           | string        | Some metrics | Generated answer. Required for offline response-quality metrics; generated as `sample.output_text` for online model requests. |
| `reference`          | string        | Some metrics | Reference answer or ground truth                                                                                              |

Different metrics require different columns. Check the metric documentation for specific requirements.

### Example Dataset

```json
{
  "user_input": "What is the capital of France?",
  "retrieved_contexts": [
    "Paris is the capital and largest city of France."
  ],
  "response": "The capital of France is Paris.",
  "reference": "Paris"
}
```

***

## Response Format

All evaluation responses follow this structure:

```json
{
  "metric": {
    "type": "faithfulness",
    "judge_model": {
      "url": "...",
      "name": "..."
    }
  },
  "aggregate_scores": {
    "scores": [
      {
        "name": "faithfulness",
        "count": 1,
        "mean": 0.95,
        "min": 0.95,
        "max": 0.95,
        "sum": 0.95
      }
    ]
  },
  "row_scores": [
    {
      "row_index": 0,
      "metrics": {
        "faithfulness": [
          {"name": "faithfulness", "value": 0.95}
        ]
      },
      "error": null
    }
  ]
}
```

## Working with Results

```python
# Access aggregate scores
for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

# Access per-row scores
for row in result.row_scores:
    if row.metrics:
        print(f"Row {row.row_index}: {row.metrics}")
    elif row.error:
        print(f"Row {row.row_index} failed: {row.error}")
```

***

## Managing Secrets for Authenticated Endpoints

Store API keys as secrets for secure authentication:

```python
client.secrets.create(name="judge-api-key", value="<your-judge-key>")
client.secrets.create(name="embedding-api-key", value="<your-embedding-key>")
```

Reference secrets by name in your metric configuration. For local `run` versus remote `submit` behavior, see [Model API Authentication](/documentation/evaluate-models/metrics/model-configuration#model-api-authentication).

```python
judge_model = Model(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="meta/llama-3.1-70b-instruct",
    api_key_secret="judge-api-key",
)
```

***

## Job Management

For durable remote execution, submit the same metric and dataset that you tested locally:

```python
job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
artifacts_dir = job.download_artifacts(path="evaluation_artifacts")
print(f"Saved artifacts under {artifacts_dir}")
```

***

## Troubleshooting

### Common Errors

| Error                             | Cause                                                                                                                                                    | Solution                                                                                                                                                           |
| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `judge_model` is required         | Missing judge LLM config for metric                                                                                                                      | Add `judge_model` to metric configuration                                                                                                                          |
| `embeddings_model` is required    | Using `response_relevancy` without embeddings                                                                                                            | Add `embeddings_model` to metric configuration                                                                                                                     |
| Job stuck in "pending"            | Model endpoint not accessible                                                                                                                            | Verify endpoint URLs and API key secrets. See [Model API Authentication](/documentation/evaluate-models/metrics/model-configuration#model-api-authentication)      |
| Authentication failed             | Invalid or missing API key                                                                                                                               | Check `api_key_secret` for the execution mode. See [Model API Authentication](/documentation/evaluate-models/metrics/model-configuration#model-api-authentication) |
| `nan_count > 0` and `mean = null` | Judge/model call failures, such as auth, endpoint, quota, or timeout. Some RAGAS metrics are known to return `NaN` instead of raising on these failures. | Inspect row-level `error`; verify API key, endpoint, and model access                                                                                              |
| Low faithfulness scores           | Context doesn't support the response                                                                                                                     | Improve retrieval or response generation                                                                                                                           |

If you see `nan_count > 0` with `mean = null`, first validate judge model authentication.

For some RAGAS metrics, auth failures can be converted to `NaN` scores instead of surfacing as a hard error.

### Tips for Better Results

* **Use larger judge models** (70B+) for more consistent scoring.
* **Start with inline datasets** to test your configuration before large evaluations.
* **Set appropriate timeouts** - judge LLM calls can take time with large contexts.
* **Use parallelism wisely** - increase `parallelism` for faster evaluation, but respect rate limits.
* **Column names matter** - RAGAS metrics use `user_input`, `retrieved_contexts`, `response`, and `reference`.

***

## Important Notes

1. **Secret Management**: API keys should be referenced through `api_key_secret`, with different local `run` and remote `submit` behavior. See [Model API Authentication](/documentation/evaluate-models/metrics/model-configuration#model-api-authentication). Never pass API keys directly in the request.
2. **Column Names**: RAGAS metrics use specific column names:
   * `user_input` (not `question`)
   * `response` (not `answer`)
   * `retrieved_contexts` (not `contexts`)
   * `reference` (not `ground_truth`)
3. **Embeddings Model**: Only `response_relevancy` requires an embeddings model. All other metrics use only the judge LLM.

## Limitations

1. **Judge Model Quality**: Evaluation quality depends on the judge model's ability to follow instructions. Larger models (70B+) typically produce more consistent results.

2. **Dataset Format**: RAGAS metrics use specific column names (`user_input`, `retrieved_contexts`, `response`, `reference`). Ensure your data matches this structure.

* [LLM-as-a-Judge](/documentation/evaluate-models/metrics/llm-as-a-judge) - Custom judge-based evaluation
* [Agentic Metrics](/documentation/evaluate-models/metrics/agentic-metrics) - Evaluate agent workflows