> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://nemo-platform.docs.buildwithfern.com/nemo/platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://nemo-platform.docs.buildwithfern.com/nemo/platform/_mcp/server.

# Similarity Metrics

<a id="eval-metrics-similarity" />

NeMo Platform offers [built-in metrics](/documentation/evaluate-models/metrics) that can be configured to evaluate on your custom data. Similarity metrics compare generated or precomputed text against references, labels, or numeric/string expectations. They support Jinja templates so you can map your dataset columns to the values each metric evaluates.

Template functionality provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. You can bring your own datasets, define your own prompts and templates using [Jinja](https://github.com/pallets/jinja), and select the metrics that matter most for your use case. This approach is ideal when:

* You want to evaluate on tasks, data, or formats not covered by industry benchmarks or built-in metrics.
* You need to measure model performance using custom or business-specific criteria.
* You want to experiment with new evaluation methodologies, metrics, or workflows.
* You need to create custom prompts and templates for specific use cases.

## Setup

```python
import os

from nemo_evaluator.sdk import Evaluator
from nemo_platform import NeMoPlatform

sdk = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
evaluator: Evaluator = sdk.evaluator  # this object is an Evaluator resource
```

Use `evaluator.run(metric=metric, dataset=dataset)` for a local synchronous evaluation. Use `evaluator.submit(metric=metric, dataset=dataset)` when you need a durable remote job:

```python
job = evaluator.submit(metric=metric, dataset=dataset)
job.wait_until_done()
result = job.get_result()
```

## Template Variables

All similarity metrics support Jinja templating with these variables:

* `{{item}}` - Access dataset columns (e.g., `{{item.question}}`, `{{item.answer}}`)
* `{{sample.output_text}}` - The model's generated output for online runs
* Jinja filters: `lower`, `upper`, `trim`, `replace`, etc.

Use Jinja filters to normalize text before comparison:

```python

from nemo_evaluator_sdk import ExactMatchMetric
metric = ExactMatchMetric(
    reference="{{item.expected | lower | trim}}",
    candidate="{{item.output | lower | trim}}",
)
```

## BLEU Metric

BLEU (Bilingual Evaluation Understudy) measures the similarity between machine-generated text and reference translations by comparing n-gram overlap. It's commonly used for evaluating machine translation and text generation tasks.

**Use BLEU when:**

* Evaluating machine translation quality
* Measuring text generation similarity to references
* Comparing multiple reference texts

**Metric Output:** A score between 0 and 100, where 100 indicates perfect match with references.

```python
from nemo_evaluator_sdk import BLEUMetric

metric = BLEUMetric(
    references=["{{item.reference_1}}", "{{item.reference_2}}"],
    candidate="{{item.model_output}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "reference_1": "The cat sits on the mat.",
            "reference_2": "A cat is sitting on the mat.",
            "model_output": "The cat is on the mat.",
        },
        {
            "reference_1": "Hello world!",
            "reference_2": "Hi world!",
            "model_output": "Hello world!",
        },
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
from nemo_evaluator_sdk import BLEUMetric

metric = BLEUMetric(
    references=["{{item.reference_1}}", "{{item.reference_2}}"],
    candidate="{{item.model_output}}",
    description="BLEU score for translation quality",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "reference_1": "The cat sits on the mat.",
            "reference_2": "A cat is sitting on the mat.",
            "model_output": "The cat is on the mat.",
        },
        {
            "reference_1": "Hello world!",
            "reference_2": "Hi world!",
            "model_output": "Hello world!",
        },
    ],
)
job.wait_until_done()
result = job.get_result()
```

```json
{
  "scores": [
    {
      "name": "sentence",
      "count": 2,
      "mean": 76.86,
      "min": 53.73,
      "max": 100.0
    },
    {
      "name": "corpus",
      "count": 1,
      "mean": 53.895
    }
  ]
}
```

## Exact Match Metric

Exact Match compares the candidate text with the reference text for perfect equality. This metric returns 1 if the strings match exactly and 0 otherwise.

**Use Exact Match when:**

* Evaluating classification tasks with discrete labels
* Checking for exact answer correctness
* Validating structured output formats

**Metric Output:** Binary score (0 or 1).

```python
from nemo_evaluator_sdk import ExactMatchMetric

metric = ExactMatchMetric(
    reference="{{item.correct_answer | lower | trim}}",
    candidate="{{item.model_answer | lower | trim}}",
    description="Exact match for question answering",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"correct_answer": "Paris", "model_answer": "Paris"},
        {"correct_answer": "London", "model_answer": "london "},
        {"correct_answer": "Berlin", "model_answer": "Munich"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
from nemo_evaluator_sdk import ExactMatchMetric

metric = ExactMatchMetric(
    reference="{{item.correct_answer | lower | trim}}",
    candidate="{{item.model_answer | lower | trim}}",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"correct_answer": "Paris", "model_answer": "Paris"},
        {"correct_answer": "London", "model_answer": "london "},
        {"correct_answer": "Berlin", "model_answer": "Munich"},
    ],
)
job.wait_until_done()
result = job.get_result()
```

```json
{
  "scores": [
    {
      "name": "exact-match",
      "count": 3,
      "mean": 0.667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}
```

## F1 Metric

F1 measures token-level overlap between candidate and reference text. It balances precision and recall, making it useful when there are multiple acceptable ways to phrase a response.

**Use F1 when:**

* Evaluating extractive question answering
* Comparing short free-form answers
* Measuring partial matches where exact match is too strict

**Metric Output:** A score between 0 and 1.

```python
from nemo_evaluator_sdk import F1Metric

metric = F1Metric(
    reference="{{item.reference}}",
    candidate="{{item.answer}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "reference": "the capital of France is Paris",
            "answer": "Paris is the capital of France",
        },
        {"reference": "a red apple", "answer": "red apple"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
from nemo_evaluator_sdk import F1Metric

metric = F1Metric(
    reference="{{item.reference}}",
    candidate="{{item.answer}}",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "reference": "the capital of France is Paris",
            "answer": "Paris is the capital of France",
        },
        {"reference": "a red apple", "answer": "red apple"},
    ],
)
job.wait_until_done()
result = job.get_result()
```

```json
{
  "scores": [
    {
      "name": "f1",
      "count": 2,
      "mean": 0.75,
      "min": 0.5,
      "max": 1.0
    }
  ]
}
```

## Number Check Metric

Number Check performs numerical comparisons and operations on extracted values. Supports equality, inequality, comparison operators, and absolute difference calculations.

**Use Number Check when:**

* Validating numerical outputs (calculations, counts, scores)
* Checking value ranges or thresholds
* Comparing predicted vs expected numbers

**Metric Output:** 1 if the condition is true, 0 otherwise. If either value cannot be parsed as a number, the row score is `NaN`.

### Supported Operations

* Equality: `"equals"`, `"=="`
* Inequality: `"!="`, `"<>"`, `"not equals"`
* Comparisons: `">"`, `"gt"`, `">="`, `"gte"`, `"<"`, `"lt"`, `"<="`, `"lte"`
* Absolute difference: `"absolute difference"` (requires `epsilon` parameter)

```python
from nemo_evaluator_sdk import NumberCheckMetric

metric = NumberCheckMetric(
    operation="absolute difference",
    epsilon=0.5,
    left_template="{{item.expected}}",
    right_template="{{item.predicted}}",
    description="Check if values match within tolerance",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"expected": "100", "predicted": "100"},
        {"expected": "42.5", "predicted": "42.3"},
        {"expected": "99", "predicted": "101"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
from nemo_evaluator_sdk import NumberCheckMetric

metric = NumberCheckMetric(
    operation=">",
    left_template="{{item.predicted}}",
    right_template="0.5",
    description="Score must be greater than 0.5",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"predicted": "1"},
        {"predicted": "0.75"},
        {"predicted": "0.5"},
        {"predicted": "0.1"},
    ],
)
job.wait_until_done()
result = job.get_result()
```

```json
{
  "scores": [
    {
      "name": "number-check",
      "count": 3,
      "mean": 0.6667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}
```

## ROUGE Metric

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated text and reference text. It is commonly used for summarization and long-form generation quality checks.

**Use ROUGE when:**

* Evaluating summarization quality
* Measuring overlap with reference passages
* Comparing generated text against longer expected answers

**Metric Output:** ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-L F1 scores between 0 and 1.

```python
from nemo_evaluator_sdk import ROUGEMetric

metric = ROUGEMetric(
    reference="{{item.reference_summary}}",
    candidate="{{item.model_summary}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {
            "reference_summary": "The cat sat on the mat and looked out the window.",
            "model_summary": "A cat sat on a mat near the window.",
        },
        {
            "reference_summary": "The launch was postponed because of high winds.",
            "model_summary": "High winds delayed the launch.",
        },
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
from nemo_evaluator_sdk import ROUGEMetric

metric = ROUGEMetric(
    reference="{{item.reference_summary}}",
    candidate="{{item.model_summary}}",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {
            "reference_summary": "The cat sat on the mat and looked out the window.",
            "model_summary": "A cat sat on a mat near the window.",
        },
        {
            "reference_summary": "The launch was postponed because of high winds.",
            "model_summary": "High winds delayed the launch.",
        },
    ],
)
job.wait_until_done()
result = job.get_result()
```

```json
{
  "scores": [
    {
      "name": "rouge_1_score",
      "count": 2,
      "mean": 0.72
    },
    {
      "name": "rouge_2_score",
      "count": 2,
      "mean": 0.43
    },
    {
      "name": "rouge_3_score",
      "count": 2,
      "mean": 0.31
    },
    {
      "name": "rouge_L_score",
      "count": 2,
      "mean": 0.67
    }
  ]
}
```

## String Check Metric

String Check performs various string operations and comparisons. Supports equality, containment, and prefix/suffix checks.

**Use String Check when:**

* Validating text format or structure
* Checking for keyword presence
* Pattern matching in generated text
* String-based classification

**Metric Output:** Binary score (1 if condition is true, 0 otherwise).

### Supported Operations

* Equality: `"equals"`, `"=="`
* Inequality: `"!="`, `"<>"`, `"not equals"`
* Containment: `"contains"`, `"not contains"`
* Pattern: `"startswith"`, `"endswith"`

```python
from nemo_evaluator_sdk import StringCheckMetric

metric = StringCheckMetric(
    operation="contains",
    left_template="{{item.output | trim}}",
    right_template="{{item.must_contain}}",
)

result = evaluator.run(
    metric=metric,
    dataset=[
        {"output": "The answer is: 42", "must_contain": "answer"},
        {"output": "Result: Success", "must_contain": "Success"},
        {"output": "Error occurred", "must_contain": "Success"},
    ],
)

for score in result.aggregate_scores.scores:
    print(f"{score.name}: mean={score.mean}")
```

```python
from nemo_evaluator_sdk import StringCheckMetric

metric = StringCheckMetric(
    operation="startswith",
    left_template="{{item.output}}",
    right_template="Answer:",
    description="Check if output starts with 'Answer:'",
)

job = evaluator.submit(
    metric=metric,
    dataset=[
        {"output": "Answer: 42"},
        {"output": "Answer: Success"},
        {"output": "Error occurred"},
    ],
)
job.wait_until_done()
result = job.get_result()
```

```json
{
  "scores": [
    {
      "name": "string-check",
      "count": 3,
      "mean": 0.667,
      "min": 0.0,
      "max": 1.0
    }
  ]
}
```

<a id="eval-custom-dataset-format" />

## Dataset Format

The examples on this page use inline dataset rows with `dataset=[...]`. Template fields determine the columns required by each metric:

* `reference`, `references`, `left_template`, and `right_template` read from `item` fields in the dataset.
* `candidate` reads from an `item` field for offline rows when configured.
* If `candidate` is omitted for BLEU, Exact Match, F1, or ROUGE, the metric uses `sample.output_text`, which is populated during online evaluations.

Keep field names consistent between the dataset rows and the templates you configure. For example, `{{item.expected}}` requires each row to include an `expected` field.

## Related Topics

* [Metric Overview](/documentation/evaluate-models/metrics)
* [Model Configuration](/documentation/evaluate-models/metrics/model-configuration)
* [RAG Evaluation Metrics](/documentation/evaluate-models/metrics/rag-metrics)