Why single-run scores lie
LLMs are non-deterministic. Even withtemperature=0, hosted APIs introduce variance through hardware parallelism and batching (ACL 2025). A 91% pass rate on 50 cases could be anywhere from 80% to 97% if you ran the same cases again.
The fix: run more cases, run each case multiple times, and use confidence intervals to understand what your score actually means.
Confidence intervals with wilson_interval
The Wilson score interval is the most reliable CI for binomial proportions — it handles small n and extreme pass rates much better than the normal approximation.
experiment.compare() shows these automatically:
Know how many cases you need
Before running an eval, calculate whether your test suite is large enough to detect the improvement you care about.| Effect size | Min cases needed |
|---|---|
| 15% improvement | ~118 |
| 10% improvement | ~291 |
| 5% improvement | ~1,248 |
| 2% improvement | ~7,700 |
Power hints in compare()
When compare() finds a difference that doesn’t reach p < 0.05, it tells you how many more cases you’d need:
Multi-run flakiness detection
Combineruns=N with statistical rigor for per-case stability analysis:
Recommended defaults
| Scenario | Setting |
|---|---|
| Quick iteration | runs=1, 20–50 cases (fast, coarse) |
| Pre-ship check | runs=3, 100+ cases |
| Regression gate | runs=5, 200+ cases, fail_threshold=0.85 |
| Significance test | runs=1, ≥291 cases for 10% delta detection |
| Flakiness audit | runs=10, any case count |
Interpretation checklist
Before trusting an eval result, ask:- Is the improvement statistically significant? (
exp.compare()shows p-value) - Are the confidence intervals non-overlapping? If CI(before) and CI(after) overlap, the difference is inconclusive.
- Do I have enough cases? Use
runs_needed()to check before running. - Are there flaky cases inflating the variance? Check
report.flaky_count.

