Experiment records every suite.run() to disk so you can compare runs across model versions, prompt changes, or time. No cloud or account required — runs are stored as JSONL under ~/.multivon/experiments/ (override with the MULTIVON_HOME environment variable).
Recording a run
Experiment(name)
| Parameter | Type | Description |
|---|---|---|
name | string | Experiment name. Used as the JSONL filename under ~/.multivon/experiments/. |
Experiment.record(report, tags=None, run_id=None)
| Parameter | Type | Default | Description |
|---|---|---|---|
report | EvalReport | required | The report returned by suite.run(). |
tags | dict[str, str] | None | Optional metadata such as model name, prompt version, or commit hash. |
run_id | string | auto | Explicit run ID. If omitted, an 8-character hex ID is generated. |
run_id string — keep it around to compare against later runs.
Each recorded run captures:
| Field | Description |
|---|---|
run_id | Unique ID for the run. |
suite_name | The suite name from the report. |
model_id | Model identifier reported by the suite. |
timestamp | ISO 8601 UTC timestamp. |
pass_rate | Fraction of cases that passed (0–1). |
avg_score | Mean score across all evaluators. |
total / passed / failed | Case counts. |
scores_by_evaluator | Per-evaluator average scores. |
tags | Whatever you passed to record(). |
Listing history
| Method | Returns | Description |
|---|---|---|
history(n=20) | list[RunRecord] | Last n runs, newest first. |
print_history(n=10) | None | Print the last n runs as a table. |
Comparing two runs
IMPROVED, REGRESSION, or No meaningful change based on pass-rate delta.
| Parameter | Type | Description |
|---|---|---|
run_id_a | string | Baseline run ID. |
run_id_b | string | New run ID. Deltas are reported as b - a. |
ValueError if either run ID is missing from the experiment.
Listing all experiments
Storage location
By default, runs are written to~/.multivon/experiments/<name>.jsonl. Override the base directory with the MULTIVON_HOME environment variable:
CLI
| Subcommand | Arguments | Description |
|---|---|---|
list | — | Print every experiment with a recent timestamp. |
history | <name> [--n N] | Show the last N runs (default 10) for an experiment. |
compare | <name> <run_a> <run_b> | Print a side-by-side diff between two runs. |
Tracking regressions in CI
Pair experiment recording withfail_threshold to catch regressions automatically:
~/.multivon/experiments/ as a CI artifact (or sync it to a shared volume) to keep history across runs.
