Skip to main content
Test Runs are scored using qualitative evaluation results that are attached to each test-run request. In the Portal, you’ll see this scoring in two main places:
  • Run-level score distribution (1–5): used for the score distribution visualization and “Pass Rate”.
  • Request-level criteria breakdown: used to compute a normalized score and show which criteria passed/failed.

Run-level scoring (1–5)

On a completed Test Run, the Results section shows a Score Distribution broken into 5 buckets:
  • 1 — Poor
  • 2 — Fair
  • 3 — Good
  • 4 — Great
  • 5 — Perfect
The Portal uses these buckets for two important workflows:
  • Pass Rate: computed as “share of completed requests with score ≥ 3”.
  • Filtering: clicking a score bucket filters the run’s requests list to that score.

Request-level criteria (1–5 stars per criterion)

Each request in a Test Run can include a list of qualitative evaluation results. Each result contains:
  • Criteria name (a human-readable label)
  • Score value (1–5)
  • Description (optional explanatory text)
In the requests table, you can hover a request’s score to see the criteria breakdown and the per-criterion star values.

Normalized “percent score” (0–100)

In the requests table, the Portal derives a normalized percent score for each request by aggregating all criteria:
normalized_score = round((sum of score_values) / (5 × criteria_count) × 100)
This normalized score is primarily a UI convenience for comparing requests within a run. The score distribution and pass rate still use the 1–5 bucket values at the run level.

How to use rubrics when debugging

  • Start with the distribution: click into low-scoring buckets to focus your debugging.
  • Use criteria tooltips: the criteria list tells you why a request scored the way it did.
  • Compare responses: for any request, use the response comparison tools to see the original vs. the run output side-by-side (see Interpreting Results).