Similarity Metrics
NeMo Platform offers built-in metrics that can be configured to evaluate on your custom data. Similarity metrics compare generated or precomputed text against references, labels, or numeric/string expectations. They support Jinja templates so you can map your dataset columns to the values each metric evaluates.
Template functionality provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. You can bring your own datasets, define your own prompts and templates using Jinja, and select the metrics that matter most for your use case. This approach is ideal when:
- You want to evaluate on tasks, data, or formats not covered by industry benchmarks or built-in metrics.
- You need to measure model performance using custom or business-specific criteria.
- You want to experiment with new evaluation methodologies, metrics, or workflows.
- You need to create custom prompts and templates for specific use cases.
Setup
Use evaluator.run(metric=metric, dataset=dataset) for a local synchronous evaluation. Use evaluator.submit(metric=metric, dataset=dataset) when you need a durable remote job:
Template Variables
All similarity metrics support Jinja templating with these variables:
{{item}}- Access dataset columns (e.g.,{{item.question}},{{item.answer}}){{sample.output_text}}- The model’s generated output for online runs- Jinja filters:
lower,upper,trim,replace, etc.
Use Jinja filters to normalize text before comparison:
BLEU Metric
BLEU (Bilingual Evaluation Understudy) measures the similarity between machine-generated text and reference translations by comparing n-gram overlap. It’s commonly used for evaluating machine translation and text generation tasks.
Use BLEU when:
- Evaluating machine translation quality
- Measuring text generation similarity to references
- Comparing multiple reference texts
Metric Output: A score between 0 and 100, where 100 indicates perfect match with references.
Local Evaluation
Remote Job
Example Result
Exact Match Metric
Exact Match compares the candidate text with the reference text for perfect equality. This metric returns 1 if the strings match exactly and 0 otherwise.
Use Exact Match when:
- Evaluating classification tasks with discrete labels
- Checking for exact answer correctness
- Validating structured output formats
Metric Output: Binary score (0 or 1).
Local Evaluation
Remote Job
Example Result
F1 Metric
F1 measures token-level overlap between candidate and reference text. It balances precision and recall, making it useful when there are multiple acceptable ways to phrase a response.
Use F1 when:
- Evaluating extractive question answering
- Comparing short free-form answers
- Measuring partial matches where exact match is too strict
Metric Output: A score between 0 and 1.
Local Evaluation
Remote Job
Example Result
Number Check Metric
Number Check performs numerical comparisons and operations on extracted values. Supports equality, inequality, comparison operators, and absolute difference calculations.
Use Number Check when:
- Validating numerical outputs (calculations, counts, scores)
- Checking value ranges or thresholds
- Comparing predicted vs expected numbers
Metric Output: 1 if the condition is true, 0 otherwise. If either value cannot be parsed as a number, the row score is NaN.
Supported Operations
- Equality:
"equals","==" - Inequality:
"!=","<>","not equals" - Comparisons:
">","gt",">=","gte","<","lt","<=","lte" - Absolute difference:
"absolute difference"(requiresepsilonparameter)
Local Evaluation
Remote Job
Example Result
ROUGE Metric
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated text and reference text. It is commonly used for summarization and long-form generation quality checks.
Use ROUGE when:
- Evaluating summarization quality
- Measuring overlap with reference passages
- Comparing generated text against longer expected answers
Metric Output: ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-L F1 scores between 0 and 1.
Local Evaluation
Remote Job
Example Result
String Check Metric
String Check performs various string operations and comparisons. Supports equality, containment, and prefix/suffix checks.
Use String Check when:
- Validating text format or structure
- Checking for keyword presence
- Pattern matching in generated text
- String-based classification
Metric Output: Binary score (1 if condition is true, 0 otherwise).
Supported Operations
- Equality:
"equals","==" - Inequality:
"!=","<>","not equals" - Containment:
"contains","not contains" - Pattern:
"startswith","endswith"
Local Evaluation
Remote Job
Example Result
Dataset Format
The examples on this page use inline dataset rows with dataset=[...]. Template fields determine the columns required by each metric:
reference,references,left_template, andright_templateread fromitemfields in the dataset.candidatereads from anitemfield for offline rows when configured.- If
candidateis omitted for BLEU, Exact Match, F1, or ROUGE, the metric usessample.output_text, which is populated during online evaluations.
Keep field names consistent between the dataset rows and the templates you configure. For example, {{item.expected}} requires each row to include an expected field.