Scoring & Rubrics

Test Runs are scored using qualitative evaluation results that are attached to each test-run request. In the Portal, you’ll see this scoring in two main places:

Run-level score distribution (1–5): used for the score distribution visualization and “Pass Rate”.
Request-level criteria breakdown: used to compute a normalized score and show which criteria passed/failed.

Run-level scoring (1–5)

On a completed Test Run, the Results section shows a Score Distribution broken into 5 buckets:

1 — Poor
2 — Fair
3 — Good
4 — Great
5 — Perfect

The Portal uses these buckets for two important workflows:

Pass Rate: computed as “share of completed requests with score ≥ 3”.
Filtering: clicking a score bucket filters the run’s requests list to that score.

Request-level criteria (1–5 stars per criterion)

Each request in a Test Run can include a list of qualitative evaluation results. Each result contains:

Criteria name (a human-readable label)
Score value (1–5)
Description (optional explanatory text)

In the requests table, you can hover a request’s score to see the criteria breakdown and the per-criterion star values.

Normalized “percent score” (0–100)

In the requests table, the Portal derives a normalized percent score for each request by aggregating all criteria:

normalized_score = round((sum of score_values) / (5 × criteria_count) × 100)

This normalized score is primarily a UI convenience for comparing requests within a run. The score distribution and pass rate still use the 1–5 bucket values at the run level.

How to use rubrics when debugging

Start with the distribution: click into low-scoring buckets to focus your debugging.
Use criteria tooltips: the criteria list tells you why a request scored the way it did.
Compare responses: for any request, use the response comparison tools to see the original vs. the run output side-by-side (see Interpreting Results).

Get Started

Observe

Test

Build

Examples

SDK Reference

CLI Reference

API Reference

Scoring & Rubrics

Run-level scoring (1–5)

Request-level criteria (1–5 stars per criterion)

Normalized “percent score” (0–100)

How to use rubrics when debugging

Get Started

Observe

Test

Build

Examples

SDK Reference

CLI Reference

API Reference

​Run-level scoring (1–5)

​Request-level criteria (1–5 stars per criterion)

​Normalized “percent score” (0–100)

​How to use rubrics when debugging

Run-level scoring (1–5)

Request-level criteria (1–5 stars per criterion)

Normalized “percent score” (0–100)

How to use rubrics when debugging