Similarity Metrics | NVIDIA NeMo Platform

NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. Similarity metrics compare generated or precomputed text against references, labels, or numeric/string expectations. They support Jinja templates so you can map your dataset columns to the values each metric evaluates.

Template functionality provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. You can bring your own datasets, define your own prompts and templates using Jinja, and select the metrics that matter most for your use case. This approach is ideal when:

You want to evaluate on tasks, data, or formats not covered by industry benchmarks or built-in metrics.
You need to measure model performance using custom or business-specific criteria.
You want to experiment with new evaluation methodologies, metrics, or workflows.
You need to create custom prompts and templates for specific use cases.

Setup

1 import os
2 
3 from nemo_evaluator.sdk import Evaluator
4 from nemo_platform import NeMoPlatform
5 
6 sdk = NeMoPlatform(
7     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
8     workspace="default",
9 )
10 evaluator: Evaluator = sdk.evaluator  # this object is an Evaluator resource

Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:

1 job = evaluator.submit(metric=metric, dataset=dataset)
2 job.wait_until_done()
3 result = job.get_result()

Template Variables

All similarity metrics support Jinja templating with these variables:

{{item}} - Access dataset columns (e.g., {{item.question}}, {{item.answer}})
{{sample.output_text}} - The model’s generated output for online runs
Jinja filters: lower, upper, trim, replace, etc.

Use Jinja filters to normalize text before comparison:

1 from nemo_evaluator_sdk import ExactMatchMetric
2 metric = ExactMatchMetric(
3     reference="{{item.expected | lower | trim}}",
4     candidate="{{item.output | lower | trim}}",
5 )

BLEU Metric

BLEU (Bilingual Evaluation Understudy) measures the similarity between machine-generated text and reference translations by comparing n-gram overlap. It’s commonly used for evaluating machine translation and text generation tasks.

Use BLEU when:

Evaluating machine translation quality
Measuring text generation similarity to references
Comparing multiple reference texts

Metric Output: A score between 0 and 100, where 100 indicates perfect match with references.

Local Evaluation

Remote Job

Example Result

1 from nemo_evaluator_sdk import BLEUMetric
2 
3 metric = BLEUMetric(
4     references=["{{item.reference_1}}", "{{item.reference_2}}"],
5     candidate="{{item.model_output}}",
6 )
7 
8 result = evaluator.run(
9     metric=metric,
10     dataset=[
11         {
12             "reference_1": "The cat sits on the mat.",
13             "reference_2": "A cat is sitting on the mat.",
14             "model_output": "The cat is on the mat.",
15         },
16         {
17             "reference_1": "Hello world!",
18             "reference_2": "Hi world!",
19             "model_output": "Hello world!",
20         },
21     ],
22 )
23 
24 for score in result.aggregate_scores.scores:
25     print(f"{score.name}: mean={score.mean}")

Exact Match Metric

Exact Match compares the candidate text with the reference text for perfect equality. This metric returns 1 if the strings match exactly and 0 otherwise.

Use Exact Match when:

Evaluating classification tasks with discrete labels
Checking for exact answer correctness
Validating structured output formats

Metric Output: Binary score (0 or 1).

Local Evaluation

Remote Job

Example Result

1 from nemo_evaluator_sdk import ExactMatchMetric
2 
3 metric = ExactMatchMetric(
4     reference="{{item.correct_answer | lower | trim}}",
5     candidate="{{item.model_answer | lower | trim}}",
6     description="Exact match for question answering",
7 )
8 
9 result = evaluator.run(
10     metric=metric,
11     dataset=[
12         {"correct_answer": "Paris", "model_answer": "Paris"},
13         {"correct_answer": "London", "model_answer": "london "},
14         {"correct_answer": "Berlin", "model_answer": "Munich"},
15     ],
16 )
17 
18 for score in result.aggregate_scores.scores:
19     print(f"{score.name}: mean={score.mean}")

F1 Metric

F1 measures token-level overlap between candidate and reference text. It balances precision and recall, making it useful when there are multiple acceptable ways to phrase a response.

Use F1 when:

Evaluating extractive question answering
Comparing short free-form answers
Measuring partial matches where exact match is too strict

Metric Output: A score between 0 and 1.

Local Evaluation

Remote Job

Example Result

1 from nemo_evaluator_sdk import F1Metric
2 
3 metric = F1Metric(
4     reference="{{item.reference}}",
5     candidate="{{item.answer}}",
6 )
7 
8 result = evaluator.run(
9     metric=metric,
10     dataset=[
11         {
12             "reference": "the capital of France is Paris",
13             "answer": "Paris is the capital of France",
14         },
15         {"reference": "a red apple", "answer": "red apple"},
16     ],
17 )
18 
19 for score in result.aggregate_scores.scores:
20     print(f"{score.name}: mean={score.mean}")

Number Check Metric

Number Check performs numerical comparisons and operations on extracted values. Supports equality, inequality, comparison operators, and absolute difference calculations.

Use Number Check when:

Validating numerical outputs (calculations, counts, scores)
Checking value ranges or thresholds
Comparing predicted vs expected numbers

Metric Output: 1 if the condition is true, 0 otherwise. If either value cannot be parsed as a number, the row score is NaN.

Supported Operations

Equality: "equals", "=="
Inequality: "!=", "<>", "not equals"
Comparisons: ">", "gt", ">=", "gte", "<", "lt", "<=", "lte"
Absolute difference: "absolute difference" (requires epsilon parameter)

Local Evaluation

Remote Job

Example Result

1 from nemo_evaluator_sdk import NumberCheckMetric
2 
3 metric = NumberCheckMetric(
4     operation="absolute difference",
5     epsilon=0.5,
6     left_template="{{item.expected}}",
7     right_template="{{item.predicted}}",
8     description="Check if values match within tolerance",
9 )
10 
11 result = evaluator.run(
12     metric=metric,
13     dataset=[
14         {"expected": "100", "predicted": "100"},
15         {"expected": "42.5", "predicted": "42.3"},
16         {"expected": "99", "predicted": "101"},
17     ],
18 )
19 
20 for score in result.aggregate_scores.scores:
21     print(f"{score.name}: mean={score.mean}")

ROUGE Metric

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated text and reference text. It is commonly used for summarization and long-form generation quality checks.

Use ROUGE when:

Evaluating summarization quality
Measuring overlap with reference passages
Comparing generated text against longer expected answers

Metric Output: ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-L F1 scores between 0 and 1.

Local Evaluation

Remote Job

Example Result

1 from nemo_evaluator_sdk import ROUGEMetric
2 
3 metric = ROUGEMetric(
4     reference="{{item.reference_summary}}",
5     candidate="{{item.model_summary}}",
6 )
7 
8 result = evaluator.run(
9     metric=metric,
10     dataset=[
11         {
12             "reference_summary": "The cat sat on the mat and looked out the window.",
13             "model_summary": "A cat sat on a mat near the window.",
14         },
15         {
16             "reference_summary": "The launch was postponed because of high winds.",
17             "model_summary": "High winds delayed the launch.",
18         },
19     ],
20 )
21 
22 for score in result.aggregate_scores.scores:
23     print(f"{score.name}: mean={score.mean}")

String Check Metric

String Check performs various string operations and comparisons. Supports equality, containment, and prefix/suffix checks.

Use String Check when:

Validating text format or structure
Checking for keyword presence
Pattern matching in generated text
String-based classification

Metric Output: Binary score (1 if condition is true, 0 otherwise).

Supported Operations

Equality: "equals", "=="
Inequality: "!=", "<>", "not equals"
Containment: "contains", "not contains"
Pattern: "startswith", "endswith"

Local Evaluation

Remote Job

Example Result

1 from nemo_evaluator_sdk import StringCheckMetric
2 
3 metric = StringCheckMetric(
4     operation="contains",
5     left_template="{{item.output | trim}}",
6     right_template="{{item.must_contain}}",
7 )
8 
9 result = evaluator.run(
10     metric=metric,
11     dataset=[
12         {"output": "The answer is: 42", "must_contain": "answer"},
13         {"output": "Result: Success", "must_contain": "Success"},
14         {"output": "Error occurred", "must_contain": "Success"},
15     ],
16 )
17 
18 for score in result.aggregate_scores.scores:
19     print(f"{score.name}: mean={score.mean}")

Dataset Format

The examples on this page use inline dataset rows with dataset=[...]. Template fields determine the columns required by each metric:

reference, references, left_template, and right_template read from item fields in the dataset.
candidate reads from an item field for offline rows when configured.
If candidate is omitted for BLEU, Exact Match, F1, or ROUGE, the metric uses sample.output_text, which is populated during online evaluations.

Keep field names consistent between the dataset rows and the templates you configure. For example, {{item.expected}} requires each row to include an expected field.

Setup

Template Variables

BLEU Metric

Local Evaluation

Remote Job

Example Result

Exact Match Metric

Local Evaluation

Remote Job

Example Result

F1 Metric

Local Evaluation

Remote Job

Example Result

Number Check Metric

Supported Operations

Local Evaluation

Remote Job

Example Result

ROUGE Metric

Local Evaluation

Remote Job

Example Result

String Check Metric

Supported Operations

Local Evaluation

Remote Job

Example Result

Dataset Format

Related Topics