Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evaldocs.multivon.ai/llms.txt

Use this file to discover all available pages before exploring further.

SelfConsistency

Detects hallucinations by checking whether stochastic samples of the same prompt agree with each other. If a model truly knows something, its outputs converge. If it hallucinated, the samples contradict. Based on SelfCheckGPT (Manakul et al., EMNLP 2023) with production improvements: adaptive sample count, free-sample reuse, and a local NLI backend. When to use: Any generation task where you have no reference document to check against — knowledge QA, open-ended explanation, conversational responses. Pair with Faithfulness when a reference is available. Key limitation: Consistency is not factuality. A model that consistently hallucinates the same popular misconception will score as consistent. This evaluator catches inconsistent hallucinations — fabrications that vary across samples.
from multivon_eval import SelfConsistency

SelfConsistency(model_fn=my_pipeline)
SelfConsistency(model_fn=my_pipeline, n=10, backend="nli")
SelfConsistency(model_fn=my_pipeline, threshold=0.8)

Backends

BackendHow it worksCostRequirements
"nli"Local DeBERTa cross-encoder measures contradiction probability per sentenceFree, ~150ms/checkpip install transformers torch
"llm"LLM judge answers “Consistent or Contradicts?” per sentence-sample pairAPI quotaAny configured judge
"auto"Uses nli if transformers is installed, else falls back to llm

Adaptive N

With adaptive=True (default), the evaluator starts at n samples. If the score is borderline (0.3–0.7), it escalates up to max_n samples automatically. Most clear cases resolve at n=5. Ambiguous cases get up to n=20.

Free-sample integration

Pass samples= to reuse outputs you’ve already generated — no extra API calls:
# Generate samples yourself and reuse at zero cost
samples = [my_pipeline(case.input) for _ in range(5)]
evaluator.evaluate(case, output, samples=samples)
This is the recommended approach when running multiple evaluators on the same cases.

Parameters

ParameterTypeDefaultDescription
model_fnCallable[[str], str]NoneThe same callable used in suite.run(). Required unless passing samples= directly
nint5Starting sample count
max_nint20Sample ceiling when adaptive=True
adaptiveboolTrueEscalate sample count when score is borderline
backendstr"auto""nli", "llm", or "auto"
nli_modelstr"cross-encoder/nli-deberta-v3-small"HuggingFace model for the NLI backend
judgeJudgeConfigNoneOverride judge for the LLM backend
thresholdfloat0.7Minimum score to pass

Example

from multivon_eval import EvalSuite, EvalCase, SelfConsistency

def my_pipeline(prompt: str) -> str:
    ...

suite = EvalSuite("Knowledge QA")
suite.add_cases([
    EvalCase(input="What causes the northern lights?"),
    EvalCase(input="How does mRNA vaccine technology work?"),
])
suite.add_evaluators(
    SelfConsistency(model_fn=my_pipeline, n=5, backend="auto"),
)
report = suite.run(my_pipeline)
With the NLI backend, no API calls are made for consistency scoring. With backend="llm", each evaluation makes up to 30 judge calls (6 sentences × 5 samples).

Score interpretation

ScoreMeaning
1.0All samples consistent — high confidence the output is factual
0.7–1.0Mostly consistent — passes default threshold
0.3–0.7Borderline — adaptive mode generates more samples
0.0–0.3High contradiction — model likely hallucinating