Skip to main content
Experiment records every suite.run() to disk so you can compare runs across model versions, prompt changes, or time. No cloud or account required — runs are stored as JSONL under ~/.multivon/experiments/ (override with the MULTIVON_HOME environment variable).

Recording a run

from multivon_eval import EvalSuite, Experiment

suite = EvalSuite("rag-pipeline")
# ... add cases and evaluators ...
report = suite.run(my_model)

exp = Experiment("rag-pipeline")
run_id = exp.record(report, tags={"model": "gpt-4o", "prompt_v": "3"})

Experiment(name)

ParameterTypeDescription
namestringExperiment name. Used as the JSONL filename under ~/.multivon/experiments/.

Experiment.record(report, tags=None, run_id=None)

ParameterTypeDefaultDescription
reportEvalReportrequiredThe report returned by suite.run().
tagsdict[str, str]NoneOptional metadata such as model name, prompt version, or commit hash.
run_idstringautoExplicit run ID. If omitted, an 8-character hex ID is generated.
Returns the run_id string — keep it around to compare against later runs. Each recorded run captures:
FieldDescription
run_idUnique ID for the run.
suite_nameThe suite name from the report.
model_idModel identifier reported by the suite.
timestampISO 8601 UTC timestamp.
pass_rateFraction of cases that passed (0–1).
avg_scoreMean score across all evaluators.
total / passed / failedCase counts.
scores_by_evaluatorPer-evaluator average scores.
tagsWhatever you passed to record().

Listing history

runs = exp.history(n=20)        # newest first
exp.print_history(n=10)          # pretty table
MethodReturnsDescription
history(n=20)list[RunRecord]Last n runs, newest first.
print_history(n=10)NonePrint the last n runs as a table.

Comparing two runs

exp.compare(old_run_id, new_run_id)
Prints a side-by-side diff: pass rate, average score, evaluator-level scores, and tags. The verdict line summarizes the change as IMPROVED, REGRESSION, or No meaningful change based on pass-rate delta.
ParameterTypeDescription
run_id_astringBaseline run ID.
run_id_bstringNew run ID. Deltas are reported as b - a.
Raises ValueError if either run ID is missing from the experiment.

Listing all experiments

from multivon_eval import list_experiments

names = list_experiments()
# ["rag-pipeline", "summarizer-v2", ...]
Returns the names of every experiment with at least one recorded run.

Storage location

By default, runs are written to ~/.multivon/experiments/<name>.jsonl. Override the base directory with the MULTIVON_HOME environment variable:
export MULTIVON_HOME=/path/to/eval-history
Each line is a JSON object with the fields listed above — easy to grep, ship to S3, or load into a notebook.

CLI

# List experiments that have at least one recorded run
multivon-eval experiments list

# Show recent runs for a specific experiment
multivon-eval experiments history rag-pipeline --n 20

# Compare two runs
multivon-eval experiments compare rag-pipeline <run_a> <run_b>
SubcommandArgumentsDescription
listPrint every experiment with a recent timestamp.
history<name> [--n N]Show the last N runs (default 10) for an experiment.
compare<name> <run_a> <run_b>Print a side-by-side diff between two runs.

Tracking regressions in CI

Pair experiment recording with fail_threshold to catch regressions automatically:
report = suite.run(my_model, fail_threshold=0.85)

exp = Experiment("rag-pipeline")
exp.record(report, tags={"commit": os.environ["GITHUB_SHA"]})
Archive ~/.multivon/experiments/ as a CI artifact (or sync it to a shared volume) to keep history across runs.