RAG Evaluation Metrics
RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.
Overview
RAGAS metrics evaluate RAG components by measuring how well retrieved contexts support generated answers. These metrics support both offline and online evaluation modes:
- Offline evaluation: Uses pre-generated responses from your dataset
- Online evaluation: Responses are generated automatically using a model and prompt template before evaluation
- Job’s model and prompt_template are used to generate responses
- Generated response (in
sample["output_text"]) is automatically used asresponsein RAGAS evaluation - RAG context variables can be included in the job’s
prompt_template:
{{user_input}}- User question/input from dataset{{retrieved_contexts}}- Retrieved context passages from dataset
RAGAS metrics require:
- Judge LLM: An LLM to evaluate answer quality (required for most metrics)
- Judge Embeddings (optional): Required for some metrics like
response_relevancy - Data: Questions, contexts, and either pre-generated responses (offline) or a model to generate them (online)
Prerequisites
Before running RAG evaluations:
- Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
- Model Endpoints: Access to judge LLM endpoints (and embeddings model for some metrics)
- API Keys (if required): Create secrets for any endpoints requiring authentication
- Initialize the SDK:
Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:
Creating a Secret for API Keys
If using external endpoints that require authentication, such as NVIDIA Build endpoints, create a secret first:
Reference secrets by name in your model configuration. For local run versus remote submit behavior, see Model API Authentication.
RAGAS metrics accept inline model definitions for judge_model and, where required, embeddings_model.
See Model Configuration for details.
Supported RAGAS Metrics
* Required Columns: Dataset columns that must be present for the metric to be evaluated.
** Requires embeddings_model in addition to judge_model.
Shared Example Setup
The metric examples below use these inline values:
For local run versus remote submit behavior of api_key_secret, see Model API Authentication.
Use offline rows when your RAG pipeline has already produced responses:
Use online arguments when the evaluator should generate the response first:
Context Recall
Measures the fraction of relevant content retrieved compared to the total relevant content in the reference.
- Score name:
context_recall - Score range: 0 to 1, with higher scores indicating better recall.
Data Format
Local Evaluation
Remote Job
Result
Context Precision
Measures the proportion of relevant chunks in the retrieved contexts (precision@k).
- Score name:
context_precision - Score range: 0 to 1, with higher scores indicating better precision.
Data Format
Local Evaluation
Remote Job
Result
Context Relevance
Measures how relevant the retrieved contexts are to the user input.
- Score name:
context_relevance - Score range: 0 to 1, with higher scores indicating better relevance.
Data Format
Local Evaluation
Remote Job
Context Entity Recall
Measures how many important entities from the reference are present in the retrieved contexts.
- Score name:
context_entity_recall - Score range: 0 to 1, with higher scores indicating better entity recall.
Data Format
Local Evaluation
Remote Job
Faithfulness
Measures factual consistency of the response with the retrieved context.
- Score name:
faithfulness - Score range: 0 to 1, with higher scores indicating the response is more faithful to the context.
Data Format
Local Evaluation
Online Evaluation
Remote Job
Response Groundedness
Evaluates whether the response is grounded in the retrieved context without hallucinations.
- Score name:
response_groundedness - Score range: 0 to 1, with higher scores indicating stronger grounding.
Data Format
Local Evaluation
Online Evaluation
Remote Job
Noise Sensitivity
Measures robustness when retrieved contexts contain noisy or irrelevant information.
- Score name:
noise_sensitivity - Score range: 0 to 1. Lower scores usually indicate the response is less sensitive to noise.
Data Format
Local Evaluation
Remote Job
Response Relevancy
Measures how relevant a response is to the user input using generated questions and embedding similarity.
- Score name:
response_relevancy - Score range: 0 to 1, with higher scores indicating better relevancy.
Data Format
Configuration Options
Local Evaluation
Online Evaluation
Remote Job
Dataset Format
RAGAS metrics use specific column names:
Different metrics require different columns. Check the metric documentation for specific requirements.
Example Dataset
Response Format
All evaluation responses follow this structure:
Working with Results
Managing Secrets for Authenticated Endpoints
Store API keys as secrets for secure authentication:
Reference secrets by name in your metric configuration. For local run versus remote submit behavior, see Model API Authentication.
Job Management
For durable remote execution, submit the same metric and dataset that you tested locally:
Troubleshooting
Common Errors
If you see nan_count > 0 with mean = null, first validate judge model authentication.
For some RAGAS metrics, auth failures can be converted to NaN scores instead of surfacing as a hard error.
Tips for Better Results
- Use larger judge models (70B+) for more consistent scoring.
- Start with inline datasets to test your configuration before large evaluations.
- Set appropriate timeouts - judge LLM calls can take time with large contexts.
- Use parallelism wisely - increase
parallelismfor faster evaluation, but respect rate limits. - Column names matter - RAGAS metrics use
user_input,retrieved_contexts,response, andreference.
Important Notes
- Secret Management: API keys should be referenced through
api_key_secret, with different localrunand remotesubmitbehavior. See Model API Authentication. Never pass API keys directly in the request. - Column Names: RAGAS metrics use specific column names:
user_input(notquestion)response(notanswer)retrieved_contexts(notcontexts)reference(notground_truth)
- Embeddings Model: Only
response_relevancyrequires an embeddings model. All other metrics use only the judge LLM.
Limitations
-
Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.
-
Dataset Format: RAGAS metrics use specific column names (
user_input,retrieved_contexts,response,reference). Ensure your data matches this structure.
- LLM-as-a-Judge - Custom judge-based evaluation
- Agentic Metrics - Evaluate agent workflows