Regression Monitoring

Regression monitoring is the practice of running Test Runs repeatedly as your configuration and models evolve, then using comparisons to catch quality regressions early.

A practical regression workflow

Build one “golden” set per Intent Group
- Keep it focused: a smaller, high-signal set is easier to maintain and reason about.
- Mark it as Golden when appropriate.
Create a Test Run for every meaningful change
- Examples: a new system prompt revision, a model swap, a temperature change, or after a fine-tune.
- Use descriptions that make comparisons easy later (e.g. “prompt v4”, “temp 0.0”, “model X”, “post-finetune run”).
Compare runs instead of trusting a single metric
- Use Compare Runs on the Test Set to see request-by-request score shifts across runs.
- Use Compare on a single request to track how that specific scenario changed over time.
Focus attention where it matters
- Click into low buckets in the score distribution (e.g. “Poor”, “Fair”) to quickly identify regressions.
- Use the criteria breakdown tooltips to understand which criterion degraded.

Common “gotchas” when monitoring regressions

Error % matters: a run with a higher pass rate but a higher error rate can still be a regression.
Latency tradeoffs: track response time percentiles when comparing configurations.
Coverage drift: as your product evolves, add new real-world failure cases to the Test Set so regressions don’t hide in untested corners.

Keeping Test Sets healthy

Continuously add new failure modes discovered in Sessions/Requests via “Add to Test Set”.
Tag requests consistently so you can quickly spot which categories are regressing.

Get Started

Observe

Test

Build

Examples

SDK Reference

CLI Reference

API Reference

Regression Monitoring

A practical regression workflow

Common “gotchas” when monitoring regressions

Keeping Test Sets healthy

Get Started

Observe

Test

Build

Examples

SDK Reference

CLI Reference

API Reference

Documentation Index

​A practical regression workflow

​Common “gotchas” when monitoring regressions

​Keeping Test Sets healthy

A practical regression workflow

Common “gotchas” when monitoring regressions

Keeping Test Sets healthy