The pattern
Passfail_threshold to suite.run(). If the pass rate drops below it, the process exits with code 1 — blocking the deployment.
GitHub Actions
Speed up CI with parallel workers
Async runner for async model functions
Save and view results
Run a specific eval file
Tips
- Keep a golden dataset — a small set of cases (20-50) that cover your most critical behaviors. Run these on every PR.
- Separate eval sets — use tags to split fast deterministic evals (run on every commit) from slower LLM-judge evals (run nightly or on main only).
- Store results as artifacts — track score trends over time by archiving
results.jsonper run.

