Metrics

When evaluating your teacher model or trained SLM, distil labs uses different metrics depending on your task type. This guide explains how to interpret each metric.

Text generation metrics

These metrics are used for question answering, classification, and other text generation tasks. For most use cases, LLM-as-a-Judge is the recommended metric as it best captures semantic correctness regardless of exact phrasing.

If we let a large language model act as a human grader, does it say this answer is good? Scores reflect semantic quality even when wording differs, making it useful when many valid answers are possible. See also this arXiv survey for research behind this approach.

Exact-Match (Binary)

Did the model give exactly the same words as the reference answer? Returns 1 for a perfect match, 0 for anything else. Great for facts that have one correct phrasing, but harsh on synonyms.

ROUGE-L

How much word-overlap is there between the answer and reference? Measures the longest common subsequence between the two texts. Higher scores indicate more shared wording; favours longer answers that reuse reference phrases. Widely used in text-summarisation tests.

METEOR

Do the two answers share words or close synonyms/stems, and is the wording fluent? Balances precision and recall, rewards correct synonyms, and penalises word-salad. Often tracks human judgements better than pure overlap metrics.

How to interpret a scorecard

  • If Exact-Match is low but LLM-as-a-Judge is high, the answers are probably correct but paraphrased—consider adding those paraphrases to your reference set.
  • If all four numbers are low, revisit your task description or give the model more context; the task may be under-specified.

Tool calling metrics

These metrics are used for tool calling tasks. For most use cases, tool_call_equivalence is the recommended metric.

Compares the prediction and reference with intelligent handling of default values - parameters that weren’t explicitly set are treated as having their default values. This better reflects real-world correctness where unset parameters fall back to defaults. Returns 1 if the tool calls are equivalent, 0 otherwise.

binary_tool_call

Compares the prediction and reference as Python dictionaries for exact equivalence. All keys must be present with identical values—the order of keys doesn’t matter. Unlike tool_call_equivalence, this metric does not account for default parameter values. Returns 1 if the tool calls are exactly equivalent, 0 otherwise.

staged_tool_call

Evaluates predictions incrementally across four stages, making it useful during development to understand where your model is failing. At each stage, if the check passes, 0.25 is added to the score:

  1. Valid JSON (0.25): Does the prediction contain valid JSON?
  2. Correct function name (0.50): Does prediction["name"] match reference["name"]?
  3. Correct parameter keys (0.75): Are the parameter keys identical between prediction and reference?
  4. Exact match (1.0): Is the entire prediction equivalent to the reference?

A score of 0.5 tells you the model called the right function but got the parameters wrong. A score of 0.25 means it produced valid JSON but called the wrong function.