Similarity Metrics

View as Markdown

NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. Similarity metrics compare generated or precomputed text against references, labels, or numeric/string expectations. They support Jinja templates so you can map your dataset columns to the values each metric evaluates.

Template functionality provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. You can bring your own datasets, define your own prompts and templates using Jinja, and select the metrics that matter most for your use case. This approach is ideal when:

  • You want to evaluate on tasks, data, or formats not covered by industry benchmarks or built-in metrics.
  • You need to measure model performance using custom or business-specific criteria.
  • You want to experiment with new evaluation methodologies, metrics, or workflows.
  • You need to create custom prompts and templates for specific use cases.

Setup

1import os
2
3from nemo_evaluator.sdk import Evaluator
4from nemo_platform import NeMoPlatform
5
6sdk = NeMoPlatform(
7 base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
8 workspace="default",
9)
10evaluator: Evaluator = sdk.evaluator # this object is an Evaluator resource

Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:

1job = evaluator.submit(metric=metric, dataset=dataset)
2job.wait_until_done()
3result = job.get_result()

Template Variables

All similarity metrics support Jinja templating with these variables:

  • {{item}} - Access dataset columns (e.g., {{item.question}}, {{item.answer}})
  • {{sample.output_text}} - The model’s generated output for online runs
  • Jinja filters: lower, upper, trim, replace, etc.

Use Jinja filters to normalize text before comparison:

1from nemo_evaluator_sdk import ExactMatchMetric
2metric = ExactMatchMetric(
3 reference="{{item.expected | lower | trim}}",
4 candidate="{{item.output | lower | trim}}",
5)

BLEU Metric

BLEU (Bilingual Evaluation Understudy) measures the similarity between machine-generated text and reference translations by comparing n-gram overlap. It’s commonly used for evaluating machine translation and text generation tasks.

Use BLEU when:

  • Evaluating machine translation quality
  • Measuring text generation similarity to references
  • Comparing multiple reference texts

Metric Output: A score between 0 and 100, where 100 indicates perfect match with references.

1from nemo_evaluator_sdk import BLEUMetric
2
3metric = BLEUMetric(
4 references=["{{item.reference_1}}", "{{item.reference_2}}"],
5 candidate="{{item.model_output}}",
6)
7
8result = evaluator.run(
9 metric=metric,
10 dataset=[
11 {
12 "reference_1": "The cat sits on the mat.",
13 "reference_2": "A cat is sitting on the mat.",
14 "model_output": "The cat is on the mat.",
15 },
16 {
17 "reference_1": "Hello world!",
18 "reference_2": "Hi world!",
19 "model_output": "Hello world!",
20 },
21 ],
22)
23
24for score in result.aggregate_scores.scores:
25 print(f"{score.name}: mean={score.mean}")

Exact Match Metric

Exact Match compares the candidate text with the reference text for perfect equality. This metric returns 1 if the strings match exactly and 0 otherwise.

Use Exact Match when:

  • Evaluating classification tasks with discrete labels
  • Checking for exact answer correctness
  • Validating structured output formats

Metric Output: Binary score (0 or 1).

1from nemo_evaluator_sdk import ExactMatchMetric
2
3metric = ExactMatchMetric(
4 reference="{{item.correct_answer | lower | trim}}",
5 candidate="{{item.model_answer | lower | trim}}",
6 description="Exact match for question answering",
7)
8
9result = evaluator.run(
10 metric=metric,
11 dataset=[
12 {"correct_answer": "Paris", "model_answer": "Paris"},
13 {"correct_answer": "London", "model_answer": "london "},
14 {"correct_answer": "Berlin", "model_answer": "Munich"},
15 ],
16)
17
18for score in result.aggregate_scores.scores:
19 print(f"{score.name}: mean={score.mean}")

F1 Metric

F1 measures token-level overlap between candidate and reference text. It balances precision and recall, making it useful when there are multiple acceptable ways to phrase a response.

Use F1 when:

  • Evaluating extractive question answering
  • Comparing short free-form answers
  • Measuring partial matches where exact match is too strict

Metric Output: A score between 0 and 1.

1from nemo_evaluator_sdk import F1Metric
2
3metric = F1Metric(
4 reference="{{item.reference}}",
5 candidate="{{item.answer}}",
6)
7
8result = evaluator.run(
9 metric=metric,
10 dataset=[
11 {
12 "reference": "the capital of France is Paris",
13 "answer": "Paris is the capital of France",
14 },
15 {"reference": "a red apple", "answer": "red apple"},
16 ],
17)
18
19for score in result.aggregate_scores.scores:
20 print(f"{score.name}: mean={score.mean}")

Number Check Metric

Number Check performs numerical comparisons and operations on extracted values. Supports equality, inequality, comparison operators, and absolute difference calculations.

Use Number Check when:

  • Validating numerical outputs (calculations, counts, scores)
  • Checking value ranges or thresholds
  • Comparing predicted vs expected numbers

Metric Output: 1 if the condition is true, 0 otherwise. If either value cannot be parsed as a number, the row score is NaN.

Supported Operations

  • Equality: "equals", "=="
  • Inequality: "!=", "<>", "not equals"
  • Comparisons: ">", "gt", ">=", "gte", "<", "lt", "<=", "lte"
  • Absolute difference: "absolute difference" (requires epsilon parameter)
1from nemo_evaluator_sdk import NumberCheckMetric
2
3metric = NumberCheckMetric(
4 operation="absolute difference",
5 epsilon=0.5,
6 left_template="{{item.expected}}",
7 right_template="{{item.predicted}}",
8 description="Check if values match within tolerance",
9)
10
11result = evaluator.run(
12 metric=metric,
13 dataset=[
14 {"expected": "100", "predicted": "100"},
15 {"expected": "42.5", "predicted": "42.3"},
16 {"expected": "99", "predicted": "101"},
17 ],
18)
19
20for score in result.aggregate_scores.scores:
21 print(f"{score.name}: mean={score.mean}")

ROUGE Metric

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated text and reference text. It is commonly used for summarization and long-form generation quality checks.

Use ROUGE when:

  • Evaluating summarization quality
  • Measuring overlap with reference passages
  • Comparing generated text against longer expected answers

Metric Output: ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-L F1 scores between 0 and 1.

1from nemo_evaluator_sdk import ROUGEMetric
2
3metric = ROUGEMetric(
4 reference="{{item.reference_summary}}",
5 candidate="{{item.model_summary}}",
6)
7
8result = evaluator.run(
9 metric=metric,
10 dataset=[
11 {
12 "reference_summary": "The cat sat on the mat and looked out the window.",
13 "model_summary": "A cat sat on a mat near the window.",
14 },
15 {
16 "reference_summary": "The launch was postponed because of high winds.",
17 "model_summary": "High winds delayed the launch.",
18 },
19 ],
20)
21
22for score in result.aggregate_scores.scores:
23 print(f"{score.name}: mean={score.mean}")

String Check Metric

String Check performs various string operations and comparisons. Supports equality, containment, and prefix/suffix checks.

Use String Check when:

  • Validating text format or structure
  • Checking for keyword presence
  • Pattern matching in generated text
  • String-based classification

Metric Output: Binary score (1 if condition is true, 0 otherwise).

Supported Operations

  • Equality: "equals", "=="
  • Inequality: "!=", "<>", "not equals"
  • Containment: "contains", "not contains"
  • Pattern: "startswith", "endswith"
1from nemo_evaluator_sdk import StringCheckMetric
2
3metric = StringCheckMetric(
4 operation="contains",
5 left_template="{{item.output | trim}}",
6 right_template="{{item.must_contain}}",
7)
8
9result = evaluator.run(
10 metric=metric,
11 dataset=[
12 {"output": "The answer is: 42", "must_contain": "answer"},
13 {"output": "Result: Success", "must_contain": "Success"},
14 {"output": "Error occurred", "must_contain": "Success"},
15 ],
16)
17
18for score in result.aggregate_scores.scores:
19 print(f"{score.name}: mean={score.mean}")

Dataset Format

The examples on this page use inline dataset rows with dataset=[...]. Template fields determine the columns required by each metric:

  • reference, references, left_template, and right_template read from item fields in the dataset.
  • candidate reads from an item field for offline rows when configured.
  • If candidate is omitted for BLEU, Exact Match, F1, or ROUGE, the metric uses sample.output_text, which is populated during online evaluations.

Keep field names consistent between the dataset rows and the templates you configure. For example, {{item.expected}} requires each row to include an expected field.