Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evaldocs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

Production targets let you point an EvalSuite at a system that’s actually serving users. Each target is just a callable that takes a string and returns a string, so you can pass it directly to suite.run() anywhere a model_fn is accepted. Install the extras you need:
pip install multivon-eval                  # core only
pip install 'multivon-eval[requests]'      # DeployedAPITarget, MultiTurnAPITarget
pip install 'multivon-eval[browser]'       # BrowserTarget (Playwright)
pip install 'multivon-eval[all]'           # everything
After installing the browser extra, also run playwright install chromium once.

DeployedAPITarget

Wraps a deployed REST endpoint as an eval target. Handles auth, retries, rate limiting, and response extraction from nested JSON.
import os
from multivon_eval import DeployedAPITarget, BearerAuth

target = DeployedAPITarget(
    url="https://api.yourapp.com/v1/chat",
    auth=BearerAuth(os.getenv("API_KEY")),
    output_path="choices.0.message.content",
)
report = suite.run(target, runs=3)

Parameters

ParameterTypeDefaultDescription
urlstrFull endpoint URL.
methodstr"POST"HTTP method.
authBearerAuth | APIKeyAuth | NoneNoneAuth helper attached to every request.
input_keystr"message"Key in the request body that receives the input string.
output_pathstr"response"Dot-notation path to extract the response from the JSON body. List indices are supported, e.g. "choices.0.message.content".
extra_bodydict[str, Any] | NoneNoneAdditional fields merged into every request body.
headersdict[str, str] | NoneNoneAdditional HTTP headers.
timeoutint30Per-request timeout in seconds.
retriesint2Number of retry attempts on 429 and 5xx responses.
rate_limitfloat | NoneNoneMax requests per second. None disables limiting.

Behavior

  • Auth. BearerAuth(token) sends Authorization: Bearer <token>. APIKeyAuth(key, header="X-API-Key") sends a custom header. Pass either one to the auth argument; their headers are merged with headers.
  • Retries. 429 and 5xx responses are retried with exponential backoff using (2 ** attempt) * 0.5 seconds between attempts. After all retries are exhausted, a RuntimeError is raised with the last status code and attempt count, e.g. DeployedAPITarget failed after 3 attempt(s): HTTP 503 after 3 attempt(s).
  • Missing dependency. If the requests package isn’t installed, the constructor raises ImportError immediately rather than failing on the first call.
  • Response extraction. output_path walks the JSON response. Each segment is treated as a list index when the current value is a list, otherwise as a dict key. Missing keys return an empty string.

MultiTurnAPITarget

Session-aware target for evaluating multi-turn conversations. Initializes a session (optional), sends the running history on each turn, and supports EvalCase.conversation.
from multivon_eval import MultiTurnAPITarget, BearerAuth

target = MultiTurnAPITarget(
    url="https://api.yourapp.com/v1/chat",
    auth=BearerAuth(os.getenv("API_KEY")),
    session_init_url="https://api.yourapp.com/v1/sessions",
    session_id_path="session_id",
    session_header="X-Session-ID",
    output_path="response",
)

final, _ = target.run_conversation([
    {"role": "user", "content": "Hi, I need to cancel my subscription."},
    {"role": "assistant", "content": "Sure — what's your account email?"},
    {"role": "user", "content": "alex@example.com"},
])

Parameters

ParameterTypeDefaultDescription
urlstrPer-turn endpoint URL.
authBearerAuth | APIKeyAuth | NoneNoneAuth helper.
session_init_urlstr | NoneNoneOptional URL to POST once at the start of a conversation to create a session.
session_id_pathstr"session_id"Dot-notation path to extract the session ID from the init response.
session_headerstr"X-Session-ID"Header name used to send the session ID on subsequent requests.
history_keystr"messages"Key in the request body that carries the conversation history so far.
input_keystr"message"Key in the request body for the current user message.
output_pathstr"response"Dot-notation path to extract the response from each turn’s JSON body.
timeoutint30Per-request timeout in seconds.
retriesint2Retry attempts on errors.

Behavior

  • Calling target(input) is a single-turn shortcut — it wraps run_conversation for suite.run() compatibility.
  • run_conversation(turns, evaluators=None) returns (final_response, eval_results). Each user turn is sent with the running history; assistant turns in the input are appended directly without making a request.
  • On error after all retries, the turn’s response is set to the literal string "[API ERROR]" and the conversation continues.

BrowserTarget

Experimental. API and behavior may change. Known limitations:
  • No page state reset between eval cases. The page stays open across calls; a chat UI that accumulates history will work, but anything with per-session state will not.
  • Login uses hard-coded selectors (input[type='email'], input[type='password']). OAuth, SSO, and CAPTCHA are not supported.
  • wait_for_load_state("networkidle") is unreliable for SPAs with long-polling or WebSocket connections. Pass a wait_for= selector to wait on a specific response element instead.
  • No context manager support. Call close() explicitly or wrap usage in try/finally to avoid leaking browser processes on failure.
Playwright-based target for browser-rendered AI applications. Opens a real browser, optionally logs in, submits input via a CSS selector, waits for the response, and extracts the response text.
from multivon_eval import BrowserTarget

target = BrowserTarget(
    url="https://chat.yourapp.com",
    input_selector="textarea[name='prompt']",
    submit_selector="button[type='submit']",
    response_selector=".message.assistant:last-child",
    wait_for=".message.assistant:last-child",
    login={"email": os.getenv("APP_USER"), "password": os.getenv("APP_PASS")},
    headless=True,
)
try:
    report = suite.run(target)
finally:
    target.close()

Parameters

ParameterTypeDefaultDescription
urlstrURL of the web app.
input_selectorstr"textarea"CSS selector for the input field.
submit_selectorstr"button[type='submit']"CSS selector for the submit button.
response_selectorstr".response"CSS selector for the response element.
wait_forstr | NoneNoneCSS selector to wait for after submit. Recommended over the default networkidle strategy for SPAs.
logindict[str, str] | NoneNoneOptional {"email": ..., "password": ...} for the login flow.
headlessboolTrueRun the browser headlessly.
timeoutint30000Page load and response wait timeout in ms.
screenshot_on_failboolTrueSave a screenshot to multivon-fail-<timestamp>.png on failure.
When a call fails, the target returns the literal string "[BROWSER ERROR: <message>]" so that the eval continues. Always call target.close() when finished.

simulate_users

Generate synthetic adversarial and edge-case user personas, run each one against any target, and evaluate the responses.
from multivon_eval import simulate_users, DeployedAPITarget, BearerAuth
from multivon_eval import Faithfulness, PIIEvaluator, TaskCompletion

target = DeployedAPITarget(
    url="https://api.yourapp.com/v1/chat",
    auth=BearerAuth(os.getenv("API_KEY")),
    output_path="response",
)

results = simulate_users(
    target=target,
    system_prompt="You are a customer support bot for a billing SaaS.",
    n_personas=10,
    evaluators=[Faithfulness(), PIIEvaluator(), TaskCompletion()],
)

Parameters

ParameterTypeDefaultDescription
targetCallable[[str], str]Any callable target — a DeployedAPITarget, BrowserTarget, or your own function.
system_promptstrDescription of your AI system. Used to generate relevant personas.
n_personasint10Total number of personas to simulate.
evaluatorslist | None[NotEmpty(), TaskCompletion()]Evaluators run on each persona response.
persona_typeslist[str] | NoneAll five typesPersona categories to include.
verboseboolTruePrint per-persona progress and a final summary.

Persona types

TypeDescription
confused_userWell-meaning but unclear about their problem; vague language.
power_userKnows what they want; precise questions; tests edge cases.
angry_userFrustrated; tests patience and de-escalation.
adversarialTries to extract the system prompt, bypass restrictions, or cause unexpected behavior.
edge_caseOff-topic or boundary-testing questions the system wasn’t designed for.

Return value

A list of dicts, one per persona:
{
    "persona": "Frustrated Frank",
    "type": "angry_user",
    "description": "Long-time customer angry about a billing error.",
    "input": "Your billing system charged me twice and nobody is helping.",
    "output": "I'm sorry to hear that. Let me look into this for you...",
    "scores": [
        {"evaluator": "Faithfulness", "score": 0.92, "passed": True, "reason": "..."},
        {"evaluator": "PIIEvaluator", "score": 1.0,  "passed": True, "reason": "..."},
    ],
    "passed": True,
}
If target invocation raises, output is set to "[TARGET ERROR: <message>]" and evaluation continues.

Auth helpers

Both targets accept the same auth helpers.
ClassHeaders sent
BearerAuth(token)Authorization: Bearer <token>
APIKeyAuth(key, header="X-API-Key")<header>: <key>
You can also implement your own — anything with a headers() -> dict[str, str] method works.