Evaluate multi-turn chat quality across an entire session.
Conversation evaluators assess quality across a full conversation, not just a single response. They use case.conversation — a list of {"role", "content"} message dicts.
from multivon_eval import EvalCasecase = EvalCase( input="Help me plan a trip to Japan", conversation=[ {"role": "user", "content": "I want to visit Japan in April"}, {"role": "assistant", "content": "April is perfect for cherry blossoms. What cities interest you?"}, {"role": "user", "content": "Tokyo and Kyoto"}, {"role": "assistant", "content": "Great choices. Tokyo for 4 days, Kyoto for 3 — here's an itinerary..."}, {"role": "user", "content": "What's my budget for this?"}, {"role": "assistant", "content": "For 7 days in Japan, budget around $150-250/day..."}, ],)