Skip to main content
For regulated industries (healthcare, finance, legal, government), your eval traces can’t leave your environment. multivon-eval’s compliance tools run entirely locally: no cloud, no LLM calls for PII detection.

PII Detection

PIIEvaluator scans LLM outputs for personally identifiable information using regex patterns. Zero API calls — suitable for air-gapped environments.

Basic usage

from multivon_eval import EvalSuite, PIIEvaluator

suite = EvalSuite("Patient Intake Bot Eval")
suite.add_evaluators(PIIEvaluator())

report = suite.run(model_fn)
A case fails if any PII is detected in the output. The failure reason lists each PII type and example matches.

Jurisdiction-specific patterns

# All patterns (default)
PIIEvaluator()

# GDPR (EU) — adds EU VAT numbers
PIIEvaluator(jurisdiction="gdpr")

# CCPA (California) — adds bank account numbers
PIIEvaluator(jurisdiction="ccpa")

# PIPEDA (Canada) — base patterns
PIIEvaluator(jurisdiction="pipeda")

# HIPAA — adds MRN, health plan numbers, VINs, fax numbers,
#          admission/discharge dates, device IDs, NPI/DEA numbers, URLs
PIIEvaluator(jurisdiction="hipaa")
HIPAA coverage note: This evaluator detects 13 of 18 HIPAA Safe Harbor PHI identifiers via regex. The remaining 5 (patient names, geographic subdivisions below state, photographs, biometric data, and arbitrary unique identifiers) cannot be reliably detected from text output and require de-identification before the text reaches the evaluator. For full HIPAA Safe Harbor compliance, combine PIIEvaluator(jurisdiction="hipaa") with an upstream de-identification step.

Custom patterns

PIIEvaluator(patterns={
    "employee_id": r"EMP-\d{6}",
    "case_number": r"CASE-[A-Z]{2}\d{8}",
})

Redacting PII from reports

By default, matched PII is shown in the reason field. To mask it in audit logs:
PIIEvaluator(redact=True)
# reason: PII detected (2 type(s)):
#   email: "[REDACTED-EMAIL]"
#   phone_us: "[REDACTED-PHONE_US]"

What’s detected

PatternExamples
emailuser@company.com
phone_us555-123-4567, (800) 555-0100
phone_intl+44 7911 123456
ssn123-45-6789
credit_card4111 1111 1111 1111
ibanDE89370400440532013000
ip_address192.168.1.1
date_of_birthDOB: 12/05/1985
passportAB1234567
address123 Main Street
eu_vat (GDPR)DE123456789
bank_account (CCPA)12345678901234

Structured Output Validation

SchemaEvaluator validates that LLM outputs conform to a defined structure. Works with Pydantic models and JSON Schema dicts. Reports per-field failures — not just valid/invalid. StructEval (2025) found GPT-4 fails complex structured extraction ~12% of the time. This evaluator catches those failures in your specific pipeline.

Pydantic model

from pydantic import BaseModel
from multivon_eval import SchemaEvaluator

class InvoiceExtraction(BaseModel):
    vendor: str
    amount: float
    currency: str
    invoice_date: str
    line_items: list[str]

suite.add_evaluators(SchemaEvaluator(InvoiceExtraction))
Supports Pydantic v1 and v2. Field-level error messages:
Schema validation failed:
  amount: Input should be a valid number, unable to parse string as a number
  currency: Field required

JSON Schema

suite.add_evaluators(SchemaEvaluator({
    "type": "object",
    "required": ["title", "score", "category"],
    "properties": {
        "title": {"type": "string", "maxLength": 100},
        "score": {"type": "number", "minimum": 0, "maximum": 1},
        "category": {"type": "string", "enum": ["positive", "negative", "neutral"]},
    }
}))

Handling markdown code fences

SchemaEvaluator automatically strips markdown code fences from outputs:
```json
{"title": "Great product", "score": 0.9, "category": "positive"}
```
This is valid — the schema evaluator strips the fence before parsing.

Compliance Audit Trail

ComplianceReporter writes a tamper-evident NDJSON log of every eval run, with SHA-256 hashing and regulatory control annotations.

Basic usage

from multivon_eval import EvalSuite, ComplianceReporter

suite = EvalSuite("HR Bot Eval")
reporter = ComplianceReporter(
    output_dir="./audit-logs",
    framework="eu-ai-act",
)

report = suite.run(model_fn)
record_id = reporter.record(report, tags={"version": "2.1", "env": "staging"})
# [compliance] audit record → a3f9b2c1  (hr_bot_eval.audit.ndjson)
# [compliance] framework: eu-ai-act

Framework mappings

# EU AI Act Article 9 annotations
ComplianceReporter(framework="eu-ai-act")

# NIST AI RMF annotations
ComplianceReporter(framework="nist-ai-rmf")

# No framework — raw scores only
ComplianceReporter(framework="none")
EU AI Act Article 9 mappings:
EvaluatorControl
faithfulness, hallucinationArticle 9(4)(a) — Accuracy & reliability
pii_detectionArticle 9(4)(b) — Privacy & data governance
schema_compliance, not_emptyArticle 9(4)(c) — Robustness & output consistency
toxicity, biasArticle 9(6) — Bias & discrimination monitoring
task_completion, tool_call_accuracyArticle 9(5) — Task performance logging

Verifying integrity

# Verify all records in the audit log are intact
ok = reporter.verify("HR Bot Eval")
#   OK  a3f9b2c1  2025-11-14T09:23:11
#   OK  b7d1e4f2  2025-11-15T14:07:42
#   Verification: PASS — all records intact

Audit record format

Each NDJSON line:
{
  "record_id": "a3f9b2c1ef20",
  "suite_name": "HR Bot Eval",
  "model_id": "claude-sonnet-4-5",
  "timestamp": "2025-11-14T09:23:11.821Z",
  "framework": "eu-ai-act",
  "summary": {
    "total": 50,
    "passed": 46,
    "pass_rate": 0.92,
    "tags": {"version": "2.1", "env": "staging"}
  },
  "evaluator_results": [
    {
      "evaluator": "faithfulness",
      "avg_score": 0.89,
      "pass_rate": 0.88,
      "control": "Article 9(4)(a) — Accuracy & reliability"
    }
  ],
  "record_hash": "sha256:e3b0c44298fc1c149afb..."
}

Full compliance pipeline

from multivon_eval import (
    EvalSuite, EvalCase,
    Faithfulness, PIIEvaluator, SchemaEvaluator,
    ComplianceReporter,
)
from pydantic import BaseModel

class ClinicalSummary(BaseModel):
    diagnosis: str
    recommended_action: str
    urgency: str

suite = EvalSuite("Clinical AI Eval")
suite.add_cases(load("tests/clinical_cases.jsonl"))
suite.add_evaluators(
    Faithfulness(),
    PIIEvaluator(jurisdiction="gdpr", redact=True),
    SchemaEvaluator(ClinicalSummary),
)

reporter = ComplianceReporter("./audit-logs", framework="eu-ai-act")
report = suite.run(model_fn)
reporter.record(report, tags={"regulatory_period": "Q4-2025"})

# Fail CI if PII detected or schema invalid
if report.pass_rate < 1.0:
    raise SystemExit(f"Compliance check failed: {report.failed} case(s) failed")

CI/CD Integration

# .github/workflows/compliance-eval.yml
jobs:
  compliance:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install multivon-eval
      - run: python evals/compliance_check.py
        # No API key needed for PIIEvaluator + SchemaEvaluator
      - uses: actions/upload-artifact@v4
        with:
          name: audit-logs
          path: ./audit-logs/
The audit logs in ./audit-logs/ are the compliance artifacts — store them alongside your release artifacts.