Run each case multiple times
- Score: mean across runs
- Pass rate: fraction of runs that passed
- Stability: whether the case behaves consistently
Reading the results
Terminal output
The reporter adds pass rate and stability columns automatically whenruns > 1:
Combine with parallel execution
Run cases in parallel and each case multiple times:Statistical significance in experiment comparison
When comparing two runs,exp.compare() now shows whether the difference is real or sampling noise:
p<0.01 ✦✦— highly significant, very unlikely to be noisep<0.05 ✦— significant at the standard thresholdp<0.10— marginal, treat with cautionp≥0.10— not significant, likely sampling noise
CI/CD: fail on instability
Recommended defaults
| Use case | runs | workers |
|---|---|---|
| Quick CI check | 1 | 4–8 |
| Nightly regression | 3 | 8 |
| Flakiness audit | 5–10 | 4 |
| Agent evaluation | 5 | 2–4 |
runs=3 for most pipelines.
How scores are aggregated
For each case across N runs:- Score: mean of per-run scores
- Passed: majority vote — passes if more than half of runs passed
- Flaky:
0 < pass_count < N(at least one pass and one fail) - Latency: mean across runs

