Evals

The evals module provides a comprehensive evaluation framework for assessing LLM outputs across multiple dimensions: correctness, safety, reliability, and performance.

Overview

Evaluators are useful for:

  • Correctness: Verify outputs match expected patterns, schemas, or ground truth

  • Safety: Detect PII leakage and sensitive information exposure

  • Reliability: Check latency consistency and performance stability

  • Quality Assurance: Automated testing of LLM responses

Status Overview

Category

Eval Type

Status

Correctness

Rule-based assertions (regex/schema)

Correctness

Ground-truth comparison

Correctness

Hallucination detection (LLM-as-judge)

Safety

PII leakage detection

Reliability

Latency consistency checks

Domain-Specific

Extraction: schema accuracy

Quick Start

from aiobs.evals import EvalInput, RegexAssertion, PIIDetectionEval

# Create input
eval_input = EvalInput(
    user_input="What is the capital of France?",
    model_output="The capital of France is Paris.",
    system_prompt="You are a geography expert."
)

# Run regex evaluation
regex_eval = RegexAssertion.from_patterns(patterns=[r"Paris"])
result = regex_eval(eval_input)
print(f"Status: {result.status.value}")  # "passed"

# Check for PII
pii_eval = PIIDetectionEval.default()
result = pii_eval(eval_input)
print(f"PII found: {result.details['pii_count']}")  # 0

Core Models

EvalInput

The standard input model for all evaluators:

Field

Type

Description

user_input

str

The user’s input/query to the model (required)

model_output

str

The model’s generated response (required)

system_prompt

Optional[str]

The system prompt provided to the model

expected_output

Optional[str]

Expected/ground-truth output for comparison evals

context

Optional[Dict]

Additional context (e.g., retrieved docs)

metadata

Optional[Dict]

Additional metadata (e.g., latency, token counts)

tags

Optional[List[str]]

Tags for categorizing eval inputs

EvalResult

Result model returned by all evaluators:

Field

Type

Description

status

EvalStatus

The evaluation status: PASSED, FAILED, ERROR, or SKIPPED

score

float

Numeric score between 0 (worst) and 1 (best)

eval_name

str

Name of the evaluator that produced this result

message

Optional[str]

Human-readable message explaining the result

details

Optional[Dict]

Detailed information about the evaluation

assertions

Optional[List[AssertionDetail]]

Individual assertion results (for multi-assertion evals)

duration_ms

Optional[float]

Time taken to run the evaluation in milliseconds

evaluated_at

datetime

Timestamp when evaluation was performed

EvalStatus

Enum representing evaluation outcomes:

  • EvalStatus.PASSED - Evaluation passed all checks

  • EvalStatus.FAILED - Evaluation failed one or more checks

  • EvalStatus.ERROR - An error occurred during evaluation

  • EvalStatus.SKIPPED - Evaluation was skipped

Correctness Evaluators

RegexAssertion

Asserts that model output matches (or doesn’t match) regex patterns.

from aiobs.evals import RegexAssertion, EvalInput

# Patterns that MUST match
evaluator = RegexAssertion.from_patterns(
    patterns=[r"Paris", r"\d+"],
    match_mode="all",  # All patterns must match (or "any")
    case_sensitive=False,
)

result = evaluator(EvalInput(
    user_input="Population of Paris?",
    model_output="Paris has about 2.1 million people."
))
print(result.status.value)  # "passed"

# Patterns that must NOT match (negative patterns)
no_apology = RegexAssertion.from_patterns(
    negative_patterns=[r"\b(sorry|cannot|unable)\b"],
    case_sensitive=False,
)

Configuration options:

Option

Default

Description

patterns

[]

Patterns that output must match

negative_patterns

[]

Patterns that output must NOT match

case_sensitive

True

Whether matching is case-sensitive

match_mode

"any"

"any" (at least one) or "all" (all must match)

SchemaAssertion

Validates that model output is valid JSON matching a JSON Schema.

from aiobs.evals import SchemaAssertion, EvalInput

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0}
    },
    "required": ["name", "age"]
}

evaluator = SchemaAssertion.from_schema(schema)

result = evaluator(EvalInput(
    user_input="Extract person info",
    model_output='{"name": "John", "age": 30}'
))
print(result.status.value)  # "passed"

# Also extracts JSON from markdown code blocks
result = evaluator(EvalInput(
    user_input="Give me JSON",
    model_output='Here is the data:\n```json\n{"name": "Alice", "age": 25}\n```'
))
print(result.status.value)  # "passed"

Note

Full JSON Schema validation requires the jsonschema package. Install with: pip install jsonschema

GroundTruthEval

Compares model output against expected ground truth.

from aiobs.evals import GroundTruthEval, EvalInput

# Exact match
exact_eval = GroundTruthEval.exact(case_sensitive=False)
result = exact_eval(EvalInput(
    user_input="What is 2+2?",
    model_output="4",
    expected_output="4"
))

# Contains match
contains_eval = GroundTruthEval.contains()
result = contains_eval(EvalInput(
    user_input="Capital of France?",
    model_output="The capital is Paris.",
    expected_output="Paris"
))

# Normalized match (whitespace/case normalized)
normalized_eval = GroundTruthEval.normalized(
    case_sensitive=False,
    strip_punctuation=True
)

Match modes:

  • exact - Exact string match

  • contains - Output contains expected string

  • normalized - Whitespace/case normalized comparison

HallucinationDetectionEval

Detects hallucinations in model outputs using an LLM-as-judge approach. This evaluator uses another LLM to analyze if the model output contains fabricated, false, or unsupported information.

from openai import OpenAI
from aiobs.evals import HallucinationDetectionEval, EvalInput

# Initialize with your LLM client
client = OpenAI()
evaluator = HallucinationDetectionEval(
    client=client,
    model="gpt-4o-mini",  # Judge model
)

# Evaluate with context (for RAG use cases)
result = evaluator(EvalInput(
    user_input="What is the capital of France?",
    model_output="Paris is the capital of France. It was founded by Julius Caesar in 250 BC.",
    context={
        "documents": ["Paris is the capital and largest city of France."]
    }
))

print(result.status.value)  # "failed" (hallucination detected)
print(result.score)  # 0.3
print(result.details["hallucinations"])  # List of detected hallucinations

# Evaluate without context (general factuality check)
result = evaluator(EvalInput(
    user_input="Who is the CEO of Apple?",
    model_output="Tim Cook is the CEO of Apple.",
))
print(result.status.value)  # "passed"

Factory methods for different providers:

# OpenAI
evaluator = HallucinationDetectionEval.with_openai(
    client=OpenAI(),
    model="gpt-4o",
)

# Gemini
from google import genai
evaluator = HallucinationDetectionEval.with_gemini(
    client=genai.Client(),
    model="gemini-2.0-flash",
)

# Anthropic
from anthropic import Anthropic
evaluator = HallucinationDetectionEval.with_anthropic(
    client=Anthropic(),
    model="claude-3-sonnet-20240229",
)

Configuration options:

Option

Default

Description

model

None

Model name for logging purposes

temperature

0.0

Temperature for the judge LLM

strict

False

Fail on any hallucination (regardless of score)

hallucination_threshold

0.5

Score below which output is considered hallucinated

check_against_context

True

Check if output is grounded in provided context

extract_claims

True

Extract and evaluate individual claims

max_claims

10

Maximum claims to extract and evaluate

Result details:

Field

Description

score

Hallucination score (1.0 = no hallucinations, 0.0 = completely hallucinated)

has_hallucinations

Boolean indicating if hallucinations were detected

hallucination_count

Number of hallucinations found

hallucinations

List of detected hallucinations with claim, reason, and severity

analysis

Overall analysis from the judge LLM

judge_model

Model used for evaluation

Safety Evaluators

PIIDetectionEval

Detects personally identifiable information (PII) in model outputs.

from aiobs.evals import PIIDetectionEval, PIIType, EvalInput

# Default detector (email, phone, SSN, credit card)
evaluator = PIIDetectionEval.default()

result = evaluator(EvalInput(
    user_input="Contact info?",
    model_output="Email me at john@example.com"
))
print(result.status.value)  # "failed" (PII detected)
print(result.details["pii_types_found"])  # ["email"]

# Scan and redact PII
matches = evaluator.scan("Call 555-123-4567")
redacted = evaluator.redact("Call 555-123-4567")
print(redacted)  # "Call [PHONE REDACTED]"

# Custom patterns
custom_eval = PIIDetectionEval.with_custom_patterns({
    "employee_id": r"EMP-\d{6}",
})

# Strict mode (checks input and system prompt too)
strict_eval = PIIDetectionEval.strict()

Supported PII types:

Type

Example Pattern

EMAIL

user@example.com

PHONE

555-123-4567, (555) 123-4567

SSN

123-45-6789

CREDIT_CARD

4111111111111111

IP_ADDRESS

192.168.1.100

DATE_OF_BIRTH

01/15/1990

Reliability Evaluators

LatencyConsistencyEval

Checks latency statistics across multiple runs.

from aiobs.evals import LatencyConsistencyEval, EvalInput

evaluator = LatencyConsistencyEval.with_thresholds(
    max_latency_ms=1000,        # Max single latency
    max_p95_ms=800,             # 95th percentile threshold
    cv_threshold=0.3,           # Coefficient of variation
)

result = evaluator(EvalInput(
    user_input="test",
    model_output="response",
    metadata={"latencies": [100, 120, 95, 110, 105]}
))

print(result.status.value)  # "passed"
print(result.details["mean"])  # 106.0
print(result.details["p95"])   # 118.0
print(result.details["cv"])    # 0.09

Configuration options:

Option

Description

max_latency_ms

Maximum acceptable latency in ms

max_std_dev_ms

Maximum acceptable standard deviation

max_p95_ms

Maximum acceptable 95th percentile

max_p99_ms

Maximum acceptable 99th percentile

coefficient_of_variation_threshold

Maximum CV (std_dev / mean)

Statistics returned:

  • count, mean, min, max, median

  • std_dev, variance, cv

  • p50, p90, p95, p99

Batch Evaluation

All evaluators support batch evaluation:

from aiobs.evals import RegexAssertion, EvalInput

evaluator = RegexAssertion.from_patterns(patterns=[r"\d+"])

inputs = [
    EvalInput(user_input="q1", model_output="Answer: 123"),
    EvalInput(user_input="q2", model_output="No numbers"),
    EvalInput(user_input="q3", model_output="Result: 456"),
]

results = evaluator.evaluate_batch(inputs)

for inp, result in zip(inputs, results):
    print(f"{inp.model_output[:20]}... → {result.status.value}")

Custom Evaluators

Create custom evaluators by extending BaseEval:

from aiobs.evals import BaseEval, EvalInput, EvalResult, EvalStatus
from typing import Any

class LengthEval(BaseEval):
    """Evaluates if output length is within bounds."""

    name = "length_eval"
    description = "Checks output length"

    def __init__(self, min_length: int = 0, max_length: int = 1000):
        super().__init__()
        self.min_length = min_length
        self.max_length = max_length

    def evaluate(self, eval_input: EvalInput, **kwargs: Any) -> EvalResult:
        length = len(eval_input.model_output)
        passed = self.min_length <= length <= self.max_length

        return EvalResult(
            status=EvalStatus.PASSED if passed else EvalStatus.FAILED,
            score=1.0 if passed else 0.0,
            eval_name=self.eval_name,
            message=f"Length {length} {'within' if passed else 'outside'} [{self.min_length}, {self.max_length}]",
            details={"length": length},
        )

# Usage
eval = LengthEval(min_length=10, max_length=500)
result = eval(EvalInput(user_input="q", model_output="Short"))

Examples

The repository includes eval examples at example/evals/:

  • regex_assertion_example.py - Pattern matching examples

  • schema_assertion_example.py - JSON schema validation

  • ground_truth_example.py - Ground truth comparison

  • hallucination_detection_example.py - Hallucination detection with LLM-as-judge

  • latency_consistency_example.py - Latency statistics

  • pii_detection_example.py - PII detection and redaction

Run the examples:

cd example/evals
PYTHONPATH=../.. python regex_assertion_example.py

API Reference

Evaluation framework for aiobs.

This module provides a comprehensive evaluation framework for assessing LLM outputs across multiple dimensions: correctness, safety, reliability, and performance.

Usage:

from aiobs.evals import RegexAssertion, EvalInput

# Create an evaluator evaluator = RegexAssertion.from_patterns(

patterns=[r”.*Paris.*”], case_sensitive=False

)

# Create input and evaluate eval_input = EvalInput(

user_input=”What is the capital of France?”, model_output=”The capital of France is Paris.”

)

result = evaluator(eval_input) print(result.status) # EvalStatus.PASSED

Available Evaluators:
  • RegexAssertion: Check output against regex patterns

  • SchemaAssertion: Validate JSON output against JSON Schema

  • GroundTruthEval: Compare output to expected ground truth

  • HallucinationDetectionEval: Detect hallucinations using LLM-as-judge

  • LatencyConsistencyEval: Check latency statistics

  • PIIDetectionEval: Detect personally identifiable information

class aiobs.evals.AssertionDetail(*, name: str, passed: bool, expected: Any | None = None, actual: Any | None = None, message: str | None = None)[source]

Bases: BaseModel

Detail for a single assertion within an evaluation.

actual: Any | None
expected: Any | None
message: str | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
passed: bool
class aiobs.evals.BaseEval(config: ConfigT | None = None)[source]

Bases: ABC

Abstract base class for all evaluators.

Evaluators assess model outputs against various criteria such as correctness, safety, performance, and more.

Subclasses must implement:
  • evaluate(): Synchronous evaluation of a single input

Optionally override:
  • evaluate_async(): Asynchronous evaluation

  • evaluate_batch(): Batch evaluation

Example usage:

from aiobs.evals import RegexAssertion, RegexAssertionConfig

config = RegexAssertionConfig(

patterns=[r”.*Paris.*”], case_sensitive=False

) evaluator = RegexAssertion(config)

result = evaluator.evaluate(
EvalInput(

user_input=”What is the capital of France?”, model_output=”The capital of France is Paris.”

)

) print(result.status) # EvalStatus.PASSED

config_class

alias of BaseEvalConfig

description: str = 'Base evaluator'
property eval_name: str

Get the name to use in results.

abstractmethod evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate a model output synchronously.

Parameters:
  • eval_input – The input containing user_input, model_output, etc.

  • **kwargs – Additional arguments for the evaluator.

Returns:

EvalResult with status, score, and details.

async evaluate_async(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate a model output asynchronously.

Default implementation calls synchronous evaluate(). Override for truly async evaluators.

Parameters:
  • eval_input – The input containing user_input, model_output, etc.

  • **kwargs – Additional arguments for the evaluator.

Returns:

EvalResult with status, score, and details.

evaluate_batch(inputs: List[EvalInput], **kwargs: Any) List[EvalResult][source]

Evaluate multiple model outputs in batch.

Default implementation calls evaluate() for each input. Override for optimized batch processing.

Parameters:
  • inputs – List of EvalInput objects to evaluate.

  • **kwargs – Additional arguments for the evaluator.

Returns:

List of EvalResult objects, one per input.

async evaluate_batch_async(inputs: List[EvalInput], **kwargs: Any) List[EvalResult][source]

Evaluate multiple model outputs asynchronously in batch.

Default implementation calls evaluate_async() for each input. Override for optimized async batch processing.

Parameters:
  • inputs – List of EvalInput objects to evaluate.

  • **kwargs – Additional arguments for the evaluator.

Returns:

List of EvalResult objects, one per input.

classmethod is_available() bool[source]

Check if this evaluator can be used (dependencies present).

Returns:

True if all required dependencies are available.

name: str = 'base_eval'
class aiobs.evals.BaseEvalConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True)[source]

Bases: BaseModel

Base configuration for all evaluators.

fail_fast: bool
include_details: bool
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str | None
class aiobs.evals.EvalInput(*, user_input: str, model_output: str, system_prompt: str | None = None, expected_output: str | None = None, context: Dict[str, Any] | None = None, metadata: Dict[str, Any] | None = None, tags: List[str] | None = None)[source]

Bases: BaseModel

Standard input model for evaluations.

This is the core data structure that evaluators use to assess model outputs. It captures the full context of an LLM interaction.

Example

eval_input = EvalInput(

user_input=”What is the capital of France?”, model_output=”The capital of France is Paris.”, system_prompt=”You are a helpful geography assistant.”

)

context: Dict[str, Any] | None
expected_output: str | None
metadata: Dict[str, Any] | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_output: str
system_prompt: str | None
tags: List[str] | None
user_input: str
with_expected(expected_output: str) EvalInput[source]

Return a copy with expected_output set.

Parameters:

expected_output – The expected/ground-truth output.

Returns:

New EvalInput with expected_output set.

with_metadata(**kwargs: Any) EvalInput[source]

Return a copy with additional metadata merged.

Parameters:

**kwargs – Key-value pairs to add to metadata.

Returns:

New EvalInput with merged metadata.

class aiobs.evals.EvalResult(*, status: ~aiobs.evals.models.eval_result.EvalStatus, score: ~typing.Annotated[float, ~annotated_types.Ge(ge=0.0), ~annotated_types.Le(le=1.0)], eval_name: str, message: str | None = None, details: ~typing.Dict[str, ~typing.Any] | None = None, assertions: ~typing.List[~aiobs.evals.models.eval_result.AssertionDetail] | None = None, duration_ms: float | None = None, evaluated_at: ~datetime.datetime = <factory>, metadata: ~typing.Dict[str, ~typing.Any] | None = None)[source]

Bases: BaseModel

Result model for evaluations.

Contains the evaluation outcome, score, and detailed information about what was evaluated and why it passed/failed.

Example

result = EvalResult(

status=EvalStatus.PASSED, score=1.0, eval_name=”regex_assertion”, message=”Output matches pattern: .*Paris.*”

)

assertions: List['AssertionDetail'] | None
details: Dict[str, Any] | None
duration_ms: float | None
classmethod error_result(eval_name: str, error: Exception, **kwargs: Any) EvalResult[source]

Create an error result.

Parameters:
  • eval_name – Name of the evaluator.

  • error – The exception that occurred.

  • **kwargs – Additional fields.

Returns:

EvalResult with ERROR status.

eval_name: str
evaluated_at: datetime
classmethod fail_result(eval_name: str, score: float = 0.0, message: str | None = None, **kwargs: Any) EvalResult[source]

Create a failing result.

Parameters:
  • eval_name – Name of the evaluator.

  • score – Score between 0 and 1 (default 0.0).

  • message – Optional message.

  • **kwargs – Additional fields.

Returns:

EvalResult with FAILED status.

property failed: bool

Check if the evaluation failed.

message: str | None
metadata: Dict[str, Any] | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod pass_result(eval_name: str, score: float = 1.0, message: str | None = None, **kwargs: Any) EvalResult[source]

Create a passing result.

Parameters:
  • eval_name – Name of the evaluator.

  • score – Score between 0 and 1 (default 1.0).

  • message – Optional message.

  • **kwargs – Additional fields.

Returns:

EvalResult with PASSED status.

property passed: bool

Check if the evaluation passed.

score: float
status: EvalStatus
class aiobs.evals.EvalStatus(value)[source]

Bases: str, Enum

Status of an evaluation result.

ERROR = 'error'
FAILED = 'failed'
PASSED = 'passed'
SKIPPED = 'skipped'
class aiobs.evals.GroundTruthConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, match_mode: GroundTruthMatchMode = GroundTruthMatchMode.NORMALIZED, case_sensitive: bool = False, normalize_whitespace: bool = True, strip_punctuation: bool = False, similarity_threshold: Annotated[float, Ge(ge=0.0), Le(le=1.0)] = 0.9)[source]

Bases: BaseEvalConfig

Configuration for ground truth comparison evaluator.

case_sensitive: bool
match_mode: GroundTruthMatchMode
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize_whitespace: bool
similarity_threshold: float
strip_punctuation: bool
class aiobs.evals.GroundTruthEval(config: GroundTruthConfig | None = None)[source]

Bases: BaseEval

Evaluator that compares model output against expected ground truth.

Supports multiple comparison modes: - exact: Exact string match - contains: Output contains expected - normalized: Whitespace/case normalized comparison - semantic: Placeholder for embedding-based comparison

Example

config = GroundTruthConfig(

match_mode=GroundTruthMatchMode.NORMALIZED, case_sensitive=False

) evaluator = GroundTruthEval(config)

result = evaluator.evaluate(
EvalInput(

user_input=”What is 2+2?”, model_output=”The answer is 4.”, expected_output=”4”

)

)

config_class

alias of GroundTruthConfig

classmethod contains(case_sensitive: bool = False) GroundTruthEval[source]

Create evaluator for contains comparison.

Parameters:

case_sensitive – Whether comparison is case-sensitive.

Returns:

Configured GroundTruthEval instance.

description: str = 'Compares model output against expected ground truth'
evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate model output against ground truth.

Parameters:
  • eval_input – Input containing model_output and expected_output.

  • **kwargs – Can contain ‘expected’ to override eval_input.expected_output.

Returns:

EvalResult indicating pass/fail.

classmethod exact(case_sensitive: bool = True) GroundTruthEval[source]

Create evaluator for exact match comparison.

Parameters:

case_sensitive – Whether comparison is case-sensitive.

Returns:

Configured GroundTruthEval instance.

name: str = 'ground_truth'
classmethod normalized(case_sensitive: bool = False, strip_punctuation: bool = False) GroundTruthEval[source]

Create evaluator for normalized comparison.

Parameters:
  • case_sensitive – Whether comparison is case-sensitive.

  • strip_punctuation – Whether to strip punctuation.

Returns:

Configured GroundTruthEval instance.

class aiobs.evals.GroundTruthMatchMode(value)[source]

Bases: str, Enum

Match modes for ground truth comparison.

CONTAINS = 'contains'
EXACT = 'exact'
NORMALIZED = 'normalized'
SEMANTIC = 'semantic'
class aiobs.evals.HallucinationDetectionConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, model: str | None = None, temperature: Annotated[float, Ge(ge=0.0), Le(le=2.0)] = 0.0, check_against_context: bool = True, check_against_input: bool = True, strict: bool = False, hallucination_threshold: Annotated[float, Ge(ge=0.0), Le(le=1.0)] = 0.5, extract_claims: bool = True, max_claims: Annotated[int, Ge(ge=1)] = 10)[source]

Bases: BaseEvalConfig

Configuration for hallucination detection evaluator.

Uses an LLM-as-judge approach to detect hallucinations in model outputs. The evaluator checks if the model output contains fabricated information that is not supported by the provided context.

check_against_context: bool
check_against_input: bool
extract_claims: bool
hallucination_threshold: float
max_claims: int
model: str | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

strict: bool
temperature: float
class aiobs.evals.HallucinationDetectionEval(client: Any, model: str, config: HallucinationDetectionConfig | None = None, temperature: float = 0.0, max_tokens: int | None = None)[source]

Bases: BaseEval

Evaluator that detects hallucinations in model outputs using LLM-as-judge.

This evaluator uses another LLM to analyze model outputs and identify hallucinations - fabricated, false, or unsupported information.

Example

from openai import OpenAI from aiobs.evals import HallucinationDetectionEval, EvalInput

# Create evaluator with OpenAI client client = OpenAI() evaluator = HallucinationDetectionEval(client=client, model=”gpt-4o”)

# Evaluate model output result = evaluator.evaluate(

EvalInput(

user_input=”What is the capital of France?”, model_output=”Paris is the capital of France. It was founded in 250 BC by Julius Caesar.”, context={“documents”: [“Paris is the capital and largest city of France.”]}

)

)

print(result.status) # EvalStatus.FAILED (hallucination detected) print(result.score) # 0.5 (moderate hallucination)

config_class

alias of HallucinationDetectionConfig

description: str = 'Detects hallucinations in model outputs using LLM-as-judge'
evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate model output for hallucinations.

Parameters:
  • eval_input – Input containing model_output to check.

  • **kwargs – Additional arguments (unused).

Returns:

EvalResult indicating presence/absence of hallucinations.

async evaluate_async(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate model output for hallucinations asynchronously.

Parameters:
  • eval_input – Input containing model_output to check.

  • **kwargs – Additional arguments (unused).

Returns:

EvalResult indicating presence/absence of hallucinations.

name: str = 'hallucination_detection'
classmethod with_anthropic(client: Any, model: str = 'claude-3-sonnet-20240229', **kwargs: Any) HallucinationDetectionEval[source]

Create evaluator with an Anthropic client.

Parameters:
  • client – Anthropic client instance.

  • model – Model name (default: claude-3-sonnet-20240229).

  • **kwargs – Additional config options.

Returns:

Configured HallucinationDetectionEval instance.

classmethod with_gemini(client: Any, model: str = 'gemini-2.0-flash', **kwargs: Any) HallucinationDetectionEval[source]

Create evaluator with a Gemini client.

Parameters:
  • client – Google GenAI client instance.

  • model – Model name (default: gemini-2.0-flash).

  • **kwargs – Additional config options.

Returns:

Configured HallucinationDetectionEval instance.

classmethod with_openai(client: Any, model: str = 'gpt-4o-mini', **kwargs: Any) HallucinationDetectionEval[source]

Create evaluator with an OpenAI client.

Parameters:
  • client – OpenAI client instance.

  • model – Model name (default: gpt-4o-mini).

  • **kwargs – Additional config options.

Returns:

Configured HallucinationDetectionEval instance.

class aiobs.evals.LatencyConsistencyConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, max_latency_ms: float | None = None, max_std_dev_ms: float | None = None, max_p95_ms: float | None = None, max_p99_ms: float | None = None, coefficient_of_variation_threshold: Annotated[float, Ge(ge=0)] = 0.5)[source]

Bases: BaseEvalConfig

Configuration for latency consistency evaluator.

coefficient_of_variation_threshold: float
max_latency_ms: float | None
max_p95_ms: float | None
max_p99_ms: float | None
max_std_dev_ms: float | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class aiobs.evals.LatencyConsistencyEval(config: LatencyConsistencyConfig | None = None)[source]

Bases: BaseEval

Evaluator that checks latency consistency across multiple runs.

This evaluator analyzes latency data to ensure: - Individual latencies are within acceptable bounds - Latency variation (std dev, CV) is acceptable - P95/P99 latencies are within bounds

The latency data should be provided in the eval_input.metadata dict under the key ‘latencies’ (list of floats in ms), or passed via kwargs.

Example

config = LatencyConsistencyConfig(

max_latency_ms=5000, max_p95_ms=4000, coefficient_of_variation_threshold=0.3

) evaluator = LatencyConsistencyEval(config)

result = evaluator.evaluate(
EvalInput(

user_input=”test query”, model_output=”test response”, metadata={“latencies”: [100, 120, 95, 110, 105]}

)

)

config_class

alias of LatencyConsistencyConfig

description: str = 'Evaluates latency consistency across multiple runs'
evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate latency consistency.

Parameters:
  • eval_input – Input with latencies in metadata[‘latencies’].

  • **kwargs – Can contain ‘latencies’ list to override.

Returns:

EvalResult indicating pass/fail with latency statistics.

name: str = 'latency_consistency'
classmethod with_thresholds(max_latency_ms: float | None = None, max_p95_ms: float | None = None, max_p99_ms: float | None = None, cv_threshold: float = 0.5) LatencyConsistencyEval[source]

Create evaluator with specific thresholds.

Parameters:
  • max_latency_ms – Maximum acceptable latency.

  • max_p95_ms – Maximum acceptable P95 latency.

  • max_p99_ms – Maximum acceptable P99 latency.

  • cv_threshold – Maximum coefficient of variation.

Returns:

Configured LatencyConsistencyEval instance.

class aiobs.evals.PIIDetectionConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, detect_types: ~typing.List[~aiobs.evals.models.configs.PIIType] = <factory>, custom_patterns: ~typing.Dict[str, str] = <factory>, redact: bool = False, fail_on_detection: bool = True, check_input: bool = False, check_system_prompt: bool = False)[source]

Bases: BaseEvalConfig

Configuration for PII detection evaluator.

check_input: bool
check_system_prompt: bool
custom_patterns: Dict[str, str]
detect_types: List[PIIType]
fail_on_detection: bool
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

redact: bool
class aiobs.evals.PIIDetectionEval(config: PIIDetectionConfig | None = None)[source]

Bases: BaseEval

Evaluator that detects PII in model outputs.

Detects common PII patterns including: - Email addresses - Phone numbers (US format) - Social Security Numbers (SSN) - Credit card numbers - IP addresses - Custom patterns

Example

config = PIIDetectionConfig(

detect_types=[PIIType.EMAIL, PIIType.PHONE, PIIType.SSN], fail_on_detection=True

) evaluator = PIIDetectionEval(config)

result = evaluator.evaluate(
EvalInput(

user_input=”What’s your email?”, model_output=”You can reach me at john@example.com

)

) # result.failed == True (email detected)

DEFAULT_PATTERNS: Dict[PIIType, str] = {PIIType.CREDIT_CARD: '\\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})\\b', PIIType.DATE_OF_BIRTH: '\\b(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12][0-9]|3[01])[/-](?:19|20)\\d{2}\\b', PIIType.EMAIL: '\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b', PIIType.IP_ADDRESS: '\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b', PIIType.PHONE: '\\b(?:\\+?1[-.\\s]?)?(?:\\(?[0-9]{3}\\)?[-.\\s]?)?[0-9]{3}[-.\\s]?[0-9]{4}\\b', PIIType.SSN: '\\b(?!000|666|9\\d{2})\\d{3}[-\\s]?(?!00)\\d{2}[-\\s]?(?!0000)\\d{4}\\b'}
REDACTION_MASKS: Dict[PIIType, str] = {PIIType.ADDRESS: '[ADDRESS REDACTED]', PIIType.CREDIT_CARD: '[CREDIT CARD REDACTED]', PIIType.CUSTOM: '[PII REDACTED]', PIIType.DATE_OF_BIRTH: '[DOB REDACTED]', PIIType.EMAIL: '[EMAIL REDACTED]', PIIType.IP_ADDRESS: '[IP REDACTED]', PIIType.NAME: '[NAME REDACTED]', PIIType.PHONE: '[PHONE REDACTED]', PIIType.SSN: '[SSN REDACTED]'}
config_class

alias of PIIDetectionConfig

classmethod default(fail_on_detection: bool = True) PIIDetectionEval[source]

Create evaluator with default PII types.

Parameters:

fail_on_detection – Whether to fail if PII is found.

Returns:

Configured PIIDetectionEval instance.

description: str = 'Detects personally identifiable information in outputs'
evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate model output for PII.

Parameters:
  • eval_input – Input containing model_output to check.

  • **kwargs – Additional arguments (unused).

Returns:

EvalResult indicating pass (no PII) or fail (PII detected).

name: str = 'pii_detection'
redact(text: str) str[source]

Redact PII from text (convenience method).

Parameters:

text – Text to redact.

Returns:

Text with PII redacted.

scan(text: str) List[PIIMatch][source]

Scan text for PII (convenience method).

Parameters:

text – Text to scan.

Returns:

List of PIIMatch objects.

classmethod strict() PIIDetectionEval[source]

Create evaluator that checks all PII types.

Returns:

Configured PIIDetectionEval instance.

classmethod with_custom_patterns(patterns: Dict[str, str], fail_on_detection: bool = True) PIIDetectionEval[source]

Create evaluator with custom patterns.

Parameters:
  • patterns – Dictionary mapping names to regex patterns.

  • fail_on_detection – Whether to fail if PII is found.

Returns:

Configured PIIDetectionEval instance.

class aiobs.evals.PIIType(value)[source]

Bases: str, Enum

Types of PII to detect.

ADDRESS = 'address'
CREDIT_CARD = 'credit_card'
CUSTOM = 'custom'
DATE_OF_BIRTH = 'date_of_birth'
EMAIL = 'email'
IP_ADDRESS = 'ip_address'
NAME = 'name'
PHONE = 'phone'
SSN = 'ssn'
class aiobs.evals.RegexAssertion(config: RegexAssertionConfig | None = None)[source]

Bases: BaseEval

Evaluator that asserts model output matches regex patterns.

This evaluator checks if the model output matches specified regex patterns and does NOT match negative patterns.

Example

# Check that output contains an email config = RegexAssertionConfig(

patterns=[r”[w.-]+@[w.-]+.w+”], case_sensitive=False

) evaluator = RegexAssertion(config)

result = evaluator.evaluate(
EvalInput(

user_input=”Give me an email”, model_output=”Contact us at support@example.com

)

) assert result.passed

# Check that output does NOT contain certain words config = RegemaAssertionConfig(

negative_patterns=[r”b(sorry|cannot|unable)b”], case_sensitive=False

)

config_class

alias of RegexAssertionConfig

description: str = "Asserts that output matches/doesn't match regex patterns"
evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate if model output matches regex patterns.

Parameters:
  • eval_input – Input containing model_output to check.

  • **kwargs – Additional arguments (unused).

Returns:

EvalResult indicating pass/fail.

classmethod from_patterns(patterns: List[str] | None = None, negative_patterns: List[str] | None = None, case_sensitive: bool = True, match_mode: str = 'any') RegexAssertion[source]

Create evaluator from pattern lists.

Parameters:
  • patterns – Patterns that must match.

  • negative_patterns – Patterns that must NOT match.

  • case_sensitive – Whether matching is case-sensitive.

  • match_mode – ‘any’ or ‘all’ for positive patterns.

Returns:

Configured RegexAssertion instance.

name: str = 'regex_assertion'
class aiobs.evals.RegexAssertionConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, patterns: ~typing.List[str] = <factory>, negative_patterns: ~typing.List[str] = <factory>, case_sensitive: bool = True, match_mode: str = 'any')[source]

Bases: BaseEvalConfig

Configuration for regex assertion evaluator.

case_sensitive: bool
get_compiled_negative_patterns() List[Pattern[str]][source]

Get compiled negative regex patterns.

Returns:

List of compiled regex Pattern objects.

get_compiled_patterns() List[Pattern[str]][source]

Get compiled regex patterns.

Returns:

List of compiled regex Pattern objects.

match_mode: str
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

negative_patterns: List[str]
patterns: List[str]
classmethod validate_match_mode(v: str) str[source]

Validate match_mode is ‘any’ or ‘all’.

class aiobs.evals.SchemaAssertion(config: SchemaAssertionConfig)[source]

Bases: BaseEval

Evaluator that asserts model output matches a JSON schema.

This evaluator validates that the model output is valid JSON and conforms to the specified JSON Schema.

Example

schema = {

“type”: “object”, “properties”: {

“name”: {“type”: “string”}, “age”: {“type”: “integer”, “minimum”: 0}

}, “required”: [“name”, “age”]

}

config = SchemaAssertionConfig(schema=schema) evaluator = SchemaAssertion(config)

result = evaluator.evaluate(
EvalInput(

user_input=”Extract person info”, model_output=’{“name”: “John”, “age”: 30}’

)

) assert result.passed

config_class

alias of SchemaAssertionConfig

description: str = 'Asserts that output is valid JSON matching a schema'
evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]

Evaluate if model output matches JSON schema.

Parameters:
  • eval_input – Input containing model_output to validate.

  • **kwargs – Additional arguments (unused).

Returns:

EvalResult indicating pass/fail.

classmethod from_schema(schema: Dict[str, Any], strict: bool = True, extract_json: bool = True) SchemaAssertion[source]

Create evaluator from a schema dict.

Parameters:
  • schema – JSON Schema dictionary.

  • strict – Whether to fail on additional properties.

  • extract_json – Whether to extract JSON from markdown.

Returns:

Configured SchemaAssertion instance.

classmethod is_available() bool[source]

Check if jsonschema is installed.

name: str = 'schema_assertion'
class aiobs.evals.SchemaAssertionConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, json_schema: Dict[str, Any], strict: bool = True, parse_json: bool = True, extract_json: bool = True)[source]

Bases: BaseEvalConfig

Configuration for JSON schema assertion evaluator.

extract_json: bool
json_schema: Dict[str, Any]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

parse_json: bool
strict: bool