Regression monitoring is the practice of running Test Runs repeatedly as your configuration and models evolve, then using comparisons to catch quality regressions early.Documentation Index
Fetch the complete documentation index at: https://docs.trymaitai.com/llms.txt
Use this file to discover all available pages before exploring further.
A practical regression workflow
-
Build one “golden” set per Intent Group
- Keep it focused: a smaller, high-signal set is easier to maintain and reason about.
- Mark it as Golden when appropriate.
-
Create a Test Run for every meaningful change
- Examples: a new system prompt revision, a model swap, a temperature change, or after a fine-tune.
- Use descriptions that make comparisons easy later (e.g. “prompt v4”, “temp 0.0”, “model X”, “post-finetune run”).
-
Compare runs instead of trusting a single metric
- Use Compare Runs on the Test Set to see request-by-request score shifts across runs.
- Use Compare on a single request to track how that specific scenario changed over time.
-
Focus attention where it matters
- Click into low buckets in the score distribution (e.g. “Poor”, “Fair”) to quickly identify regressions.
- Use the criteria breakdown tooltips to understand which criterion degraded.
Common “gotchas” when monitoring regressions
- Error % matters: a run with a higher pass rate but a higher error rate can still be a regression.
- Latency tradeoffs: track response time percentiles when comparing configurations.
- Coverage drift: as your product evolves, add new real-world failure cases to the Test Set so regressions don’t hide in untested corners.
Keeping Test Sets healthy
- Continuously add new failure modes discovered in Sessions/Requests via “Add to Test Set”.
- Tag requests consistently so you can quickly spot which categories are regressing.