LLM Judge Evaluators

LLM judge evaluators use a secondary model to assess output quality. multivon-eval uses QAG scoring (Question-Answer Generation) — generating binary yes/no questions about the output instead of asking for a numeric 1-10 rating. Why QAG? Binary questions are easier for LLMs to get right, fully auditable (you see which questions passed), and cheaper (shorter prompts).

Configuration

Set your judge model via environment variables:

export ANTHROPIC_API_KEY=sk-ant-...
export JUDGE_MODEL=claude-sonnet-4-6       # default
export JUDGE_PROVIDER=anthropic            # anthropic or openai

The model under test and the judge model can be different providers.

Faithfulness

Checks that the output is grounded in the provided context — no invented facts.

from multivon_eval import Faithfulness

Faithfulness()
Faithfulness(threshold=0.8)

Requires context on the EvalCase.

EvalCase(
    input="What does the report say about Q3 revenue?",
    context="Q3 revenue was $4.2M, up 18% YoY...",
)

Hallucination

Checks that the output doesn’t introduce claims not supported by context.

from multivon_eval import Hallucination

Hallucination()

Requires context. Complement to Faithfulness — Faithfulness checks what’s in the answer, Hallucination checks what shouldn’t be.

Relevance

Checks that the output actually addresses the input question.

from multivon_eval import Relevance

Relevance()
Relevance(threshold=0.7)

No context required.

Coherence

Checks that the output is clear, well-structured, and logically sound.

from multivon_eval import Coherence

Coherence()

Toxicity

Checks that the output is safe, non-harmful, and appropriate.

from multivon_eval import Toxicity

Toxicity()

Bias

Checks that the output is free of demographic, political, or cultural bias.

from multivon_eval import Bias

Bias()

Summarization

Checks that a summary captures the key points of the source faithfully.

from multivon_eval import Summarization

Summarization()

Requires context (the source document).

AnswerAccuracy

Checks factual correctness of the output against expected_output.

from multivon_eval import AnswerAccuracy

AnswerAccuracy()

ContextPrecision

For RAG systems: checks that retrieved context is actually relevant to the question.

from multivon_eval import ContextPrecision

ContextPrecision()

ContextRecall

For RAG systems: checks that all information needed to answer the question was retrieved.

from multivon_eval import ContextRecall

ContextRecall()

CustomRubric

Define your own yes/no criteria. Each criterion is a (question, expected_answer) tuple.

from multivon_eval import CustomRubric

CustomRubric(
    name="support_quality",
    criteria=[
        ("Does the response acknowledge the customer's problem?", True),
        ("Does the response provide a concrete next step?", True),
        ("Does the response use apologetic or defensive language?", False),
        ("Is the response under 150 words?", True),
    ],
    threshold=0.75,
)

Score = fraction of criteria where the model’s answer matches expected_answer.

GEval

Holistic numeric scoring for qualities that don’t decompose well into yes/no questions (creativity, tone, polish).

from multivon_eval import GEval

GEval(
    name="writing_quality",
    criteria="The response is engaging, concise, and professionally written.",
    threshold=0.7,
)

GEval is the only evaluator that uses a 0.0–1.0 numeric score directly from the judge rather than QAG aggregation. Use it sparingly.

Getting Started

Evaluators

Guides

Configuration

Faithfulness

Hallucination

Relevance

Coherence

Toxicity

Bias

Summarization

AnswerAccuracy

ContextPrecision

ContextRecall

CustomRubric

GEval

Getting Started

Evaluators

Guides

​Configuration

​Faithfulness

​Hallucination

​Relevance

​Coherence

​Toxicity

​Bias

​Summarization

​AnswerAccuracy

​ContextPrecision

​ContextRecall

​CustomRubric

​GEval

Configuration

Faithfulness

Hallucination

Relevance

Coherence

Toxicity

Bias

Summarization

AnswerAccuracy

ContextPrecision

ContextRecall

CustomRubric

GEval