Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evaldocs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

The integrations layer lets you connect multivon-eval to whatever framework your agent runs on. It provides three base abstractions you can extend or use directly:
  • AgentTracer — wrap any agent to capture execution traces
  • CaseImporter — pull pre-existing runs from an observability platform
  • ManualTracer — instrument agents that don’t use a callback framework
Install the extras you need:
pip install multivon-eval                     # core only
pip install multivon-eval langchain-core      # LangChain tracing
pip install multivon-eval langsmith langchain-core  # LangSmith tracing + import

ManualTracer

For any agent that doesn’t use a callback framework. You call the tracer explicitly from inside your agent code.
from multivon_eval.integrations import ManualTracer

tracer = ManualTracer()

def my_agent(input_text: str) -> str:
    with tracer.step("I need to look this up") as step:
        result = search_tool(input_text)
        step.record_tool_call("search", {"query": input_text}, result)

    with tracer.step("Now I can answer") as step:
        answer = llm_call(result)
        step.set_output(answer)

    return answer

report = suite.run(my_agent, tracer=tracer)
tracer.step(thought) returns a context manager. Inside it, call record_tool_call(name, arguments, result) for each tool the agent uses and set_output(text) for the final answer. You can also record calls at the top level without a step context:
tracer.record_tool_call("search", {"query": q}, result)
tracer.record_output("The answer is...")

LangChainTracer

For LangChain/LangGraph agents. Hooks into BaseCallbackHandler to capture tool calls and agent steps automatically. Your agent must accept and forward **kwargs so the callback handler reaches the underlying chain:
from multivon_eval.integrations import LangChainTracer

tracer = LangChainTracer()

def my_langchain_agent(input_text: str, **kwargs) -> str:
    return agent_executor.invoke(
        {"input": input_text},
        config={"callbacks": kwargs.get("callbacks", [])},
    )

report = suite.run(my_langchain_agent, tracer=tracer)
The tracer captures:
  • on_agent_action — agent’s reasoning thought and which tool to call
  • on_tool_start / on_tool_end — tool inputs and results
  • on_tool_error — errors captured as [ERROR: ...] in the result
  • on_agent_finish — final output

LangSmithTracer

Same as LangChainTracer but also logs runs to LangSmith for observability. Teams already using LangSmith get both the eval trace and the LangSmith run record with no extra work.
from multivon_eval.integrations import LangSmithTracer

tracer = LangSmithTracer(
    project_name="my-agent-evals",
    api_key=os.environ["LANGSMITH_API_KEY"],  # or omit to use LANGSMITH_API_KEY env var
)

report = suite.run(my_langchain_agent, tracer=tracer)
If langchain-core or langsmith is not installed, it silently falls back to trace-only mode (no LangSmith upload).

LangSmithImporter

Pull existing LangSmith runs as eval cases — no need to re-run your agent.
from multivon_eval.integrations import LangSmithImporter

importer = LangSmithImporter(
    project_name="production-agent",
    api_key=os.environ["LANGSMITH_API_KEY"],
)

cases = importer.load(limit=100)

suite.add_cases(cases)
report = suite.run(importer.as_model_fn(cases))
as_model_fn(cases) returns a replay function that returns each run’s original output in order — no model calls needed.

Filtering

# Only successful runs
cases = importer.load(filter='and(eq(error, ""), gt(latency, 0))')

# Runs with a specific tag
cases = importer.load(filter='has(tags, "production")')

# LLM runs instead of chain runs
cases = importer.load(run_type="llm", limit=50)
Each imported EvalCase has:
  • input — extracted from run.inputs (auto-detected or set via input_key=)
  • agent_trace — populated from child runs (tool calls, LLM steps)
  • metadata["_output"] — original run output (used by as_model_fn)
  • metadata["_run_id"] — LangSmith run ID
  • metadata["_project"] — project name
  • metadata["_error"] — error string if the run failed

How tracers wire into EvalSuite

report = suite.run(model_fn, tracer=tracer)
The suite calls tracer.reset() before each case, runs the (instrumented) model function, then calls tracer.get_trace() and attaches the result to the case. The trace is then available to all agent evaluators. Tracers require workers=1 (the default). Tracers are stateful — running cases in parallel would mix up traces.
# This raises ValueError:
suite.run(model_fn, tracer=tracer, workers=4)

Building a custom tracer

Extend AgentTracer to integrate with any framework:
from multivon_eval.integrations import AgentTracer
from multivon_eval import AgentStep, ToolCall

class MyFrameworkTracer(AgentTracer):
    def instrument(self, fn):
        tracer = self

        def wrapped(input_text: str) -> str:
            tracer.reset()
            output = fn(input_text)
            # populate tracer._steps from your framework's execution record
            return output

        return wrapped

    def get_trace(self) -> list[AgentStep]:
        return list(self._steps)
For callback-style frameworks, extend CallbackTracer instead — it implements instrument() for you:
from multivon_eval.integrations import CallbackTracer

class MyCallbackTracer(CallbackTracer):
    def _build_handler(self):
        # return a framework-specific callback object
        # instrument() will inject it as kwargs["callbacks"]
        return MyFrameworkCallbackHandler(self._steps)

Building a custom importer

Extend CaseImporter to pull runs from any observability platform:
from multivon_eval.integrations import CaseImporter
from multivon_eval import EvalCase

class MyPlatformImporter(CaseImporter):
    def __init__(self, project: str, api_key: str):
        self._project = project
        self._client = MyPlatformClient(api_key)

    def load(self, *, limit: int = 100, **kwargs) -> list[EvalCase]:
        runs = self._client.list_runs(project=self._project, limit=limit)
        return [
            EvalCase(
                input=run.input_text,
                metadata={
                    "_output": run.output_text,
                    "_run_id": run.id,
                },
            )
            for run in runs
        ]
as_model_fn() is provided by the base class — no need to implement it.