EvalSuite instances with the right evaluators for common use cases. One line to get started; fully customizable from there.
EvalSuite — you can add more evaluators, override thresholds, or call .run() with runs=N, workers=N, or fail_threshold=0.85 exactly as you would with a manually configured suite.
Available suites
| Factory | Evaluators included | Best for |
|---|---|---|
for_rag() | NotEmpty, Faithfulness, Hallucination, ContextPrecision, ContextRecall, Relevance | RAG pipelines, QA systems |
for_agents() | ToolCallAccuracy, ToolCallNecessity, TrajectoryEfficiency, PlanQuality, TaskCompletion | LLM agents, tool-use systems |
for_support_bot() | NotEmpty, Faithfulness, Relevance, Coherence, Toxicity | Customer support, help desks |
for_summarization() | NotEmpty, Faithfulness, Coherence, Relevance, Summarization | Document summarizers, digests |
for_document_intelligence() | NotEmpty, Faithfulness, AnswerAccuracy (+ SchemaEvaluator if schema provided) | Data extraction, structured output |
for_regulated() | PIIEvaluator, NotEmpty, Faithfulness, Relevance (+ SchemaEvaluator if schema provided) | Healthcare, finance, legal, government |
for_chatbot() | ConversationRelevance, KnowledgeRetention, TurnConsistency, ConversationCompleteness | Multi-turn chatbots, dialogue systems |
for_classification() | NotEmpty, ExactMatch, AnswerAccuracy | Intent classification, tagging, routing |
for_coding() | NotEmpty, ExactMatch, AnswerAccuracy, ROUGE | Code generation, function completion, test generation |
for_medical() | PIIEvaluator, NotEmpty, Faithfulness, AnswerAccuracy, Hallucination | Clinical decision support, medical Q&A, patient chatbots |
for_legal() | NotEmpty, Faithfulness, Hallucination, AnswerAccuracy, Bias | Contract review, legal Q&A, regulatory guidance |
for_financial() | NotEmpty, Faithfulness, Hallucination, AnswerAccuracy, PIIEvaluator | Financial advice bots, earnings summarizers |
EvalSuite.for_rag()
name— suite name (default"RAG Eval")threshold— pass threshold applied to all evaluators (default0.85)
EvalSuite.for_agents()
name— suite name (default"Agent Eval")require_order— passed toToolCallAccuracy; ifTrue, tool call order must match expected (defaultFalse)
EvalSuite.for_support_bot()
EvalSuite.for_summarization()
Faithfulness here checks whether the summary introduces claims not in the source — for long documents, use with cases that include the source document as context.
EvalSuite.for_document_intelligence()
name— suite name (default"Document Intelligence Eval")schema— optional Pydantic model class or JSON Schema dict. If provided,SchemaEvaluatoris added as the first evaluator — structure is checked before content.
schema parameter is the key differentiator: it separates parse failures from semantic failures.
EvalSuite.for_regulated()
name— suite name (default"Regulated AI Eval")jurisdiction— passed toPIIEvaluator:"hipaa","gdpr","ccpa","pipeda", or"all"(default"hipaa")schema— optional Pydantic model or JSON Schema. If provided,SchemaEvaluator(strict=True)is added.
PIIEvaluator runs locally — no data leaves your environment. Pair with ComplianceReporter to generate audit trails:
EvalSuite.for_chatbot()
EvalCase with a conversation field (list of prior turns) rather than a single input.
EvalSuite.for_classification()
expected_output to the correct label.
Customizing a factory suite
All factory methods return a standardEvalSuite. You can add evaluators, adjust existing ones, or swap the name:
EvalSuite.for_coding()
language parameter is reserved for future language-specific evaluators.
EvalSuite.for_medical()
name— suite name (default"Medical AI Eval")jurisdiction— passed toPIIEvaluator:"hipaa","gdpr","ccpa","pipeda", or"all"(default"hipaa")
ComplianceReporter to produce tamper-evident audit trails:
EvalSuite.for_legal()
EvalSuite.for_financial()
PIIEvaluator(jurisdiction="all") to catch PII across all supported frameworks. Pair with ComplianceReporter for regulatory audit trails (SEC, FINRA, MiFID II).
Future improvements
- Thresholds calibrated by industry benchmarks
- Multi-modal evaluation (vision + text)
for_coding()language-specific evaluators (syntax checking, test execution)

