Why multivon-eval
Most eval tools force you into one mode: either fast-but-shallow (string matching) or thorough-but-expensive (LLM judge everywhere). multivon-eval gives you four tiers and lets you mix them freely.Deterministic
String matching, regex, JSON schema, BLEU, ROUGE, latency. Instant, free, no LLM.
LLM-as-judge
QAG scoring — yes/no questions instead of unreliable 1-10 ratings. Faithfulness, hallucination, relevance, and more.
Agent trace
Tool call accuracy, plan quality, step faithfulness, task completion. Framework-agnostic.
Conversation
Knowledge retention, relevance, consistency, and completeness across multi-turn sessions.
Key concepts
QAG scoring — Instead of asking a judge “rate this 1-10”, we generate binary yes/no questions about the output and score by fraction answered correctly. More reliable, auditable, and cheaper. Tiers — Use deterministic checks for things you can verify mechanically. Reach for LLM judges only where it matters. Pay for compute selectively. CI/CD first — One line exits with code 1 if your pass rate drops below a threshold. Evals that don’t run in CI catch nothing.Install
ANTHROPIC_API_KEY and/or OPENAI_API_KEY for LLM-judge evaluators.
Quickstart
Up and running in 5 minutes

