Evals¶
The evals module provides a comprehensive evaluation framework for assessing LLM outputs across multiple dimensions: correctness, safety, reliability, and performance.
Overview¶
Evaluators are useful for:
Correctness: Verify outputs match expected patterns, schemas, or ground truth
Safety: Detect PII leakage and sensitive information exposure
Reliability: Check latency consistency and performance stability
Quality Assurance: Automated testing of LLM responses
Status Overview¶
Category |
Eval Type |
Status |
|---|---|---|
Correctness |
Rule-based assertions (regex/schema) |
✅ |
Correctness |
Ground-truth comparison |
✅ |
Correctness |
Hallucination detection (LLM-as-judge) |
✅ |
Safety |
PII leakage detection |
✅ |
Reliability |
Latency consistency checks |
✅ |
Domain-Specific |
Extraction: schema accuracy |
✅ |
Quick Start¶
from aiobs.evals import EvalInput, RegexAssertion, PIIDetectionEval
# Create input
eval_input = EvalInput(
user_input="What is the capital of France?",
model_output="The capital of France is Paris.",
system_prompt="You are a geography expert."
)
# Run regex evaluation
regex_eval = RegexAssertion.from_patterns(patterns=[r"Paris"])
result = regex_eval(eval_input)
print(f"Status: {result.status.value}") # "passed"
# Check for PII
pii_eval = PIIDetectionEval.default()
result = pii_eval(eval_input)
print(f"PII found: {result.details['pii_count']}") # 0
Core Models¶
EvalInput¶
The standard input model for all evaluators:
Field |
Type |
Description |
|---|---|---|
|
|
The user’s input/query to the model (required) |
|
|
The model’s generated response (required) |
|
|
The system prompt provided to the model |
|
|
Expected/ground-truth output for comparison evals |
|
|
Additional context (e.g., retrieved docs) |
|
|
Additional metadata (e.g., latency, token counts) |
|
|
Tags for categorizing eval inputs |
EvalResult¶
Result model returned by all evaluators:
Field |
Type |
Description |
|---|---|---|
|
|
The evaluation status: |
|
|
Numeric score between 0 (worst) and 1 (best) |
|
|
Name of the evaluator that produced this result |
|
|
Human-readable message explaining the result |
|
|
Detailed information about the evaluation |
|
|
Individual assertion results (for multi-assertion evals) |
|
|
Time taken to run the evaluation in milliseconds |
|
|
Timestamp when evaluation was performed |
EvalStatus¶
Enum representing evaluation outcomes:
EvalStatus.PASSED- Evaluation passed all checksEvalStatus.FAILED- Evaluation failed one or more checksEvalStatus.ERROR- An error occurred during evaluationEvalStatus.SKIPPED- Evaluation was skipped
Correctness Evaluators¶
RegexAssertion¶
Asserts that model output matches (or doesn’t match) regex patterns.
from aiobs.evals import RegexAssertion, EvalInput
# Patterns that MUST match
evaluator = RegexAssertion.from_patterns(
patterns=[r"Paris", r"\d+"],
match_mode="all", # All patterns must match (or "any")
case_sensitive=False,
)
result = evaluator(EvalInput(
user_input="Population of Paris?",
model_output="Paris has about 2.1 million people."
))
print(result.status.value) # "passed"
# Patterns that must NOT match (negative patterns)
no_apology = RegexAssertion.from_patterns(
negative_patterns=[r"\b(sorry|cannot|unable)\b"],
case_sensitive=False,
)
Configuration options:
Option |
Default |
Description |
|---|---|---|
|
|
Patterns that output must match |
|
|
Patterns that output must NOT match |
|
|
Whether matching is case-sensitive |
|
|
|
SchemaAssertion¶
Validates that model output is valid JSON matching a JSON Schema.
from aiobs.evals import SchemaAssertion, EvalInput
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0}
},
"required": ["name", "age"]
}
evaluator = SchemaAssertion.from_schema(schema)
result = evaluator(EvalInput(
user_input="Extract person info",
model_output='{"name": "John", "age": 30}'
))
print(result.status.value) # "passed"
# Also extracts JSON from markdown code blocks
result = evaluator(EvalInput(
user_input="Give me JSON",
model_output='Here is the data:\n```json\n{"name": "Alice", "age": 25}\n```'
))
print(result.status.value) # "passed"
Note
Full JSON Schema validation requires the jsonschema package.
Install with: pip install jsonschema
GroundTruthEval¶
Compares model output against expected ground truth.
from aiobs.evals import GroundTruthEval, EvalInput
# Exact match
exact_eval = GroundTruthEval.exact(case_sensitive=False)
result = exact_eval(EvalInput(
user_input="What is 2+2?",
model_output="4",
expected_output="4"
))
# Contains match
contains_eval = GroundTruthEval.contains()
result = contains_eval(EvalInput(
user_input="Capital of France?",
model_output="The capital is Paris.",
expected_output="Paris"
))
# Normalized match (whitespace/case normalized)
normalized_eval = GroundTruthEval.normalized(
case_sensitive=False,
strip_punctuation=True
)
Match modes:
exact- Exact string matchcontains- Output contains expected stringnormalized- Whitespace/case normalized comparison
HallucinationDetectionEval¶
Detects hallucinations in model outputs using an LLM-as-judge approach. This evaluator uses another LLM to analyze if the model output contains fabricated, false, or unsupported information.
from openai import OpenAI
from aiobs.evals import HallucinationDetectionEval, EvalInput
# Initialize with your LLM client
client = OpenAI()
evaluator = HallucinationDetectionEval(
client=client,
model="gpt-4o-mini", # Judge model
)
# Evaluate with context (for RAG use cases)
result = evaluator(EvalInput(
user_input="What is the capital of France?",
model_output="Paris is the capital of France. It was founded by Julius Caesar in 250 BC.",
context={
"documents": ["Paris is the capital and largest city of France."]
}
))
print(result.status.value) # "failed" (hallucination detected)
print(result.score) # 0.3
print(result.details["hallucinations"]) # List of detected hallucinations
# Evaluate without context (general factuality check)
result = evaluator(EvalInput(
user_input="Who is the CEO of Apple?",
model_output="Tim Cook is the CEO of Apple.",
))
print(result.status.value) # "passed"
Factory methods for different providers:
# OpenAI
evaluator = HallucinationDetectionEval.with_openai(
client=OpenAI(),
model="gpt-4o",
)
# Gemini
from google import genai
evaluator = HallucinationDetectionEval.with_gemini(
client=genai.Client(),
model="gemini-2.0-flash",
)
# Anthropic
from anthropic import Anthropic
evaluator = HallucinationDetectionEval.with_anthropic(
client=Anthropic(),
model="claude-3-sonnet-20240229",
)
Configuration options:
Option |
Default |
Description |
|---|---|---|
|
|
Model name for logging purposes |
|
|
Temperature for the judge LLM |
|
|
Fail on any hallucination (regardless of score) |
|
|
Score below which output is considered hallucinated |
|
|
Check if output is grounded in provided context |
|
|
Extract and evaluate individual claims |
|
|
Maximum claims to extract and evaluate |
Result details:
Field |
Description |
|---|---|
|
Hallucination score (1.0 = no hallucinations, 0.0 = completely hallucinated) |
|
Boolean indicating if hallucinations were detected |
|
Number of hallucinations found |
|
List of detected hallucinations with claim, reason, and severity |
|
Overall analysis from the judge LLM |
|
Model used for evaluation |
Safety Evaluators¶
PIIDetectionEval¶
Detects personally identifiable information (PII) in model outputs.
from aiobs.evals import PIIDetectionEval, PIIType, EvalInput
# Default detector (email, phone, SSN, credit card)
evaluator = PIIDetectionEval.default()
result = evaluator(EvalInput(
user_input="Contact info?",
model_output="Email me at john@example.com"
))
print(result.status.value) # "failed" (PII detected)
print(result.details["pii_types_found"]) # ["email"]
# Scan and redact PII
matches = evaluator.scan("Call 555-123-4567")
redacted = evaluator.redact("Call 555-123-4567")
print(redacted) # "Call [PHONE REDACTED]"
# Custom patterns
custom_eval = PIIDetectionEval.with_custom_patterns({
"employee_id": r"EMP-\d{6}",
})
# Strict mode (checks input and system prompt too)
strict_eval = PIIDetectionEval.strict()
Supported PII types:
Type |
Example Pattern |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Reliability Evaluators¶
LatencyConsistencyEval¶
Checks latency statistics across multiple runs.
from aiobs.evals import LatencyConsistencyEval, EvalInput
evaluator = LatencyConsistencyEval.with_thresholds(
max_latency_ms=1000, # Max single latency
max_p95_ms=800, # 95th percentile threshold
cv_threshold=0.3, # Coefficient of variation
)
result = evaluator(EvalInput(
user_input="test",
model_output="response",
metadata={"latencies": [100, 120, 95, 110, 105]}
))
print(result.status.value) # "passed"
print(result.details["mean"]) # 106.0
print(result.details["p95"]) # 118.0
print(result.details["cv"]) # 0.09
Configuration options:
Option |
Description |
|---|---|
|
Maximum acceptable latency in ms |
|
Maximum acceptable standard deviation |
|
Maximum acceptable 95th percentile |
|
Maximum acceptable 99th percentile |
|
Maximum CV (std_dev / mean) |
Statistics returned:
count,mean,min,max,medianstd_dev,variance,cvp50,p90,p95,p99
Batch Evaluation¶
All evaluators support batch evaluation:
from aiobs.evals import RegexAssertion, EvalInput
evaluator = RegexAssertion.from_patterns(patterns=[r"\d+"])
inputs = [
EvalInput(user_input="q1", model_output="Answer: 123"),
EvalInput(user_input="q2", model_output="No numbers"),
EvalInput(user_input="q3", model_output="Result: 456"),
]
results = evaluator.evaluate_batch(inputs)
for inp, result in zip(inputs, results):
print(f"{inp.model_output[:20]}... → {result.status.value}")
Custom Evaluators¶
Create custom evaluators by extending BaseEval:
from aiobs.evals import BaseEval, EvalInput, EvalResult, EvalStatus
from typing import Any
class LengthEval(BaseEval):
"""Evaluates if output length is within bounds."""
name = "length_eval"
description = "Checks output length"
def __init__(self, min_length: int = 0, max_length: int = 1000):
super().__init__()
self.min_length = min_length
self.max_length = max_length
def evaluate(self, eval_input: EvalInput, **kwargs: Any) -> EvalResult:
length = len(eval_input.model_output)
passed = self.min_length <= length <= self.max_length
return EvalResult(
status=EvalStatus.PASSED if passed else EvalStatus.FAILED,
score=1.0 if passed else 0.0,
eval_name=self.eval_name,
message=f"Length {length} {'within' if passed else 'outside'} [{self.min_length}, {self.max_length}]",
details={"length": length},
)
# Usage
eval = LengthEval(min_length=10, max_length=500)
result = eval(EvalInput(user_input="q", model_output="Short"))
Examples¶
The repository includes eval examples at example/evals/:
regex_assertion_example.py- Pattern matching examplesschema_assertion_example.py- JSON schema validationground_truth_example.py- Ground truth comparisonhallucination_detection_example.py- Hallucination detection with LLM-as-judgelatency_consistency_example.py- Latency statisticspii_detection_example.py- PII detection and redaction
Run the examples:
cd example/evals
PYTHONPATH=../.. python regex_assertion_example.py
API Reference¶
Evaluation framework for aiobs.
This module provides a comprehensive evaluation framework for assessing LLM outputs across multiple dimensions: correctness, safety, reliability, and performance.
- Usage:
from aiobs.evals import RegexAssertion, EvalInput
# Create an evaluator evaluator = RegexAssertion.from_patterns(
patterns=[r”.*Paris.*”], case_sensitive=False
)
# Create input and evaluate eval_input = EvalInput(
user_input=”What is the capital of France?”, model_output=”The capital of France is Paris.”
)
result = evaluator(eval_input) print(result.status) # EvalStatus.PASSED
- Available Evaluators:
RegexAssertion: Check output against regex patterns
SchemaAssertion: Validate JSON output against JSON Schema
GroundTruthEval: Compare output to expected ground truth
HallucinationDetectionEval: Detect hallucinations using LLM-as-judge
LatencyConsistencyEval: Check latency statistics
PIIDetectionEval: Detect personally identifiable information
- class aiobs.evals.AssertionDetail(*, name: str, passed: bool, expected: Any | None = None, actual: Any | None = None, message: str | None = None)[source]¶
Bases:
BaseModelDetail for a single assertion within an evaluation.
- actual: Any | None¶
- expected: Any | None¶
- message: str | None¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str¶
- passed: bool¶
- class aiobs.evals.BaseEval(config: ConfigT | None = None)[source]¶
Bases:
ABCAbstract base class for all evaluators.
Evaluators assess model outputs against various criteria such as correctness, safety, performance, and more.
- Subclasses must implement:
evaluate(): Synchronous evaluation of a single input
- Optionally override:
evaluate_async(): Asynchronous evaluation
evaluate_batch(): Batch evaluation
- Example usage:
from aiobs.evals import RegexAssertion, RegexAssertionConfig
- config = RegexAssertionConfig(
patterns=[r”.*Paris.*”], case_sensitive=False
) evaluator = RegexAssertion(config)
- result = evaluator.evaluate(
- EvalInput(
user_input=”What is the capital of France?”, model_output=”The capital of France is Paris.”
)
) print(result.status) # EvalStatus.PASSED
- config_class¶
alias of
BaseEvalConfig
- description: str = 'Base evaluator'¶
- property eval_name: str¶
Get the name to use in results.
- abstractmethod evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate a model output synchronously.
- Parameters:
eval_input – The input containing user_input, model_output, etc.
**kwargs – Additional arguments for the evaluator.
- Returns:
EvalResult with status, score, and details.
- async evaluate_async(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate a model output asynchronously.
Default implementation calls synchronous evaluate(). Override for truly async evaluators.
- Parameters:
eval_input – The input containing user_input, model_output, etc.
**kwargs – Additional arguments for the evaluator.
- Returns:
EvalResult with status, score, and details.
- evaluate_batch(inputs: List[EvalInput], **kwargs: Any) List[EvalResult][source]¶
Evaluate multiple model outputs in batch.
Default implementation calls evaluate() for each input. Override for optimized batch processing.
- Parameters:
inputs – List of EvalInput objects to evaluate.
**kwargs – Additional arguments for the evaluator.
- Returns:
List of EvalResult objects, one per input.
- async evaluate_batch_async(inputs: List[EvalInput], **kwargs: Any) List[EvalResult][source]¶
Evaluate multiple model outputs asynchronously in batch.
Default implementation calls evaluate_async() for each input. Override for optimized async batch processing.
- Parameters:
inputs – List of EvalInput objects to evaluate.
**kwargs – Additional arguments for the evaluator.
- Returns:
List of EvalResult objects, one per input.
- classmethod is_available() bool[source]¶
Check if this evaluator can be used (dependencies present).
- Returns:
True if all required dependencies are available.
- name: str = 'base_eval'¶
- class aiobs.evals.BaseEvalConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True)[source]¶
Bases:
BaseModelBase configuration for all evaluators.
- fail_fast: bool¶
- include_details: bool¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str | None¶
- class aiobs.evals.EvalInput(*, user_input: str, model_output: str, system_prompt: str | None = None, expected_output: str | None = None, context: Dict[str, Any] | None = None, metadata: Dict[str, Any] | None = None, tags: List[str] | None = None)[source]¶
Bases:
BaseModelStandard input model for evaluations.
This is the core data structure that evaluators use to assess model outputs. It captures the full context of an LLM interaction.
Example
- eval_input = EvalInput(
user_input=”What is the capital of France?”, model_output=”The capital of France is Paris.”, system_prompt=”You are a helpful geography assistant.”
)
- context: Dict[str, Any] | None¶
- expected_output: str | None¶
- metadata: Dict[str, Any] | None¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_output: str¶
- system_prompt: str | None¶
- tags: List[str] | None¶
- user_input: str¶
- class aiobs.evals.EvalResult(*, status: ~aiobs.evals.models.eval_result.EvalStatus, score: ~typing.Annotated[float, ~annotated_types.Ge(ge=0.0), ~annotated_types.Le(le=1.0)], eval_name: str, message: str | None = None, details: ~typing.Dict[str, ~typing.Any] | None = None, assertions: ~typing.List[~aiobs.evals.models.eval_result.AssertionDetail] | None = None, duration_ms: float | None = None, evaluated_at: ~datetime.datetime = <factory>, metadata: ~typing.Dict[str, ~typing.Any] | None = None)[source]¶
Bases:
BaseModelResult model for evaluations.
Contains the evaluation outcome, score, and detailed information about what was evaluated and why it passed/failed.
Example
- result = EvalResult(
status=EvalStatus.PASSED, score=1.0, eval_name=”regex_assertion”, message=”Output matches pattern: .*Paris.*”
)
- assertions: List['AssertionDetail'] | None¶
- details: Dict[str, Any] | None¶
- duration_ms: float | None¶
- classmethod error_result(eval_name: str, error: Exception, **kwargs: Any) EvalResult[source]¶
Create an error result.
- Parameters:
eval_name – Name of the evaluator.
error – The exception that occurred.
**kwargs – Additional fields.
- Returns:
EvalResult with ERROR status.
- eval_name: str¶
- evaluated_at: datetime¶
- classmethod fail_result(eval_name: str, score: float = 0.0, message: str | None = None, **kwargs: Any) EvalResult[source]¶
Create a failing result.
- Parameters:
eval_name – Name of the evaluator.
score – Score between 0 and 1 (default 0.0).
message – Optional message.
**kwargs – Additional fields.
- Returns:
EvalResult with FAILED status.
- property failed: bool¶
Check if the evaluation failed.
- message: str | None¶
- metadata: Dict[str, Any] | None¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod pass_result(eval_name: str, score: float = 1.0, message: str | None = None, **kwargs: Any) EvalResult[source]¶
Create a passing result.
- Parameters:
eval_name – Name of the evaluator.
score – Score between 0 and 1 (default 1.0).
message – Optional message.
**kwargs – Additional fields.
- Returns:
EvalResult with PASSED status.
- property passed: bool¶
Check if the evaluation passed.
- score: float¶
- status: EvalStatus¶
- class aiobs.evals.EvalStatus(value)[source]¶
Bases:
str,EnumStatus of an evaluation result.
- ERROR = 'error'¶
- FAILED = 'failed'¶
- PASSED = 'passed'¶
- SKIPPED = 'skipped'¶
- class aiobs.evals.GroundTruthConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, match_mode: GroundTruthMatchMode = GroundTruthMatchMode.NORMALIZED, case_sensitive: bool = False, normalize_whitespace: bool = True, strip_punctuation: bool = False, similarity_threshold: Annotated[float, Ge(ge=0.0), Le(le=1.0)] = 0.9)[source]¶
Bases:
BaseEvalConfigConfiguration for ground truth comparison evaluator.
- case_sensitive: bool¶
- match_mode: GroundTruthMatchMode¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- normalize_whitespace: bool¶
- similarity_threshold: float¶
- strip_punctuation: bool¶
- class aiobs.evals.GroundTruthEval(config: GroundTruthConfig | None = None)[source]¶
Bases:
BaseEvalEvaluator that compares model output against expected ground truth.
Supports multiple comparison modes: - exact: Exact string match - contains: Output contains expected - normalized: Whitespace/case normalized comparison - semantic: Placeholder for embedding-based comparison
Example
- config = GroundTruthConfig(
match_mode=GroundTruthMatchMode.NORMALIZED, case_sensitive=False
) evaluator = GroundTruthEval(config)
- result = evaluator.evaluate(
- EvalInput(
user_input=”What is 2+2?”, model_output=”The answer is 4.”, expected_output=”4”
)
)
- config_class¶
alias of
GroundTruthConfig
- classmethod contains(case_sensitive: bool = False) GroundTruthEval[source]¶
Create evaluator for contains comparison.
- Parameters:
case_sensitive – Whether comparison is case-sensitive.
- Returns:
Configured GroundTruthEval instance.
- description: str = 'Compares model output against expected ground truth'¶
- evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate model output against ground truth.
- Parameters:
eval_input – Input containing model_output and expected_output.
**kwargs – Can contain ‘expected’ to override eval_input.expected_output.
- Returns:
EvalResult indicating pass/fail.
- classmethod exact(case_sensitive: bool = True) GroundTruthEval[source]¶
Create evaluator for exact match comparison.
- Parameters:
case_sensitive – Whether comparison is case-sensitive.
- Returns:
Configured GroundTruthEval instance.
- name: str = 'ground_truth'¶
- classmethod normalized(case_sensitive: bool = False, strip_punctuation: bool = False) GroundTruthEval[source]¶
Create evaluator for normalized comparison.
- Parameters:
case_sensitive – Whether comparison is case-sensitive.
strip_punctuation – Whether to strip punctuation.
- Returns:
Configured GroundTruthEval instance.
- class aiobs.evals.GroundTruthMatchMode(value)[source]¶
Bases:
str,EnumMatch modes for ground truth comparison.
- CONTAINS = 'contains'¶
- EXACT = 'exact'¶
- NORMALIZED = 'normalized'¶
- SEMANTIC = 'semantic'¶
- class aiobs.evals.HallucinationDetectionConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, model: str | None = None, temperature: Annotated[float, Ge(ge=0.0), Le(le=2.0)] = 0.0, check_against_context: bool = True, check_against_input: bool = True, strict: bool = False, hallucination_threshold: Annotated[float, Ge(ge=0.0), Le(le=1.0)] = 0.5, extract_claims: bool = True, max_claims: Annotated[int, Ge(ge=1)] = 10)[source]¶
Bases:
BaseEvalConfigConfiguration for hallucination detection evaluator.
Uses an LLM-as-judge approach to detect hallucinations in model outputs. The evaluator checks if the model output contains fabricated information that is not supported by the provided context.
- check_against_context: bool¶
- check_against_input: bool¶
- extract_claims: bool¶
- hallucination_threshold: float¶
- max_claims: int¶
- model: str | None¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- strict: bool¶
- temperature: float¶
- class aiobs.evals.HallucinationDetectionEval(client: Any, model: str, config: HallucinationDetectionConfig | None = None, temperature: float = 0.0, max_tokens: int | None = None)[source]¶
Bases:
BaseEvalEvaluator that detects hallucinations in model outputs using LLM-as-judge.
This evaluator uses another LLM to analyze model outputs and identify hallucinations - fabricated, false, or unsupported information.
Example
from openai import OpenAI from aiobs.evals import HallucinationDetectionEval, EvalInput
# Create evaluator with OpenAI client client = OpenAI() evaluator = HallucinationDetectionEval(client=client, model=”gpt-4o”)
# Evaluate model output result = evaluator.evaluate(
- EvalInput(
user_input=”What is the capital of France?”, model_output=”Paris is the capital of France. It was founded in 250 BC by Julius Caesar.”, context={“documents”: [“Paris is the capital and largest city of France.”]}
)
)
print(result.status) # EvalStatus.FAILED (hallucination detected) print(result.score) # 0.5 (moderate hallucination)
- config_class¶
alias of
HallucinationDetectionConfig
- description: str = 'Detects hallucinations in model outputs using LLM-as-judge'¶
- evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate model output for hallucinations.
- Parameters:
eval_input – Input containing model_output to check.
**kwargs – Additional arguments (unused).
- Returns:
EvalResult indicating presence/absence of hallucinations.
- async evaluate_async(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate model output for hallucinations asynchronously.
- Parameters:
eval_input – Input containing model_output to check.
**kwargs – Additional arguments (unused).
- Returns:
EvalResult indicating presence/absence of hallucinations.
- name: str = 'hallucination_detection'¶
- classmethod with_anthropic(client: Any, model: str = 'claude-3-sonnet-20240229', **kwargs: Any) HallucinationDetectionEval[source]¶
Create evaluator with an Anthropic client.
- Parameters:
client – Anthropic client instance.
model – Model name (default: claude-3-sonnet-20240229).
**kwargs – Additional config options.
- Returns:
Configured HallucinationDetectionEval instance.
- classmethod with_gemini(client: Any, model: str = 'gemini-2.0-flash', **kwargs: Any) HallucinationDetectionEval[source]¶
Create evaluator with a Gemini client.
- Parameters:
client – Google GenAI client instance.
model – Model name (default: gemini-2.0-flash).
**kwargs – Additional config options.
- Returns:
Configured HallucinationDetectionEval instance.
- classmethod with_openai(client: Any, model: str = 'gpt-4o-mini', **kwargs: Any) HallucinationDetectionEval[source]¶
Create evaluator with an OpenAI client.
- Parameters:
client – OpenAI client instance.
model – Model name (default: gpt-4o-mini).
**kwargs – Additional config options.
- Returns:
Configured HallucinationDetectionEval instance.
- class aiobs.evals.LatencyConsistencyConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, max_latency_ms: float | None = None, max_std_dev_ms: float | None = None, max_p95_ms: float | None = None, max_p99_ms: float | None = None, coefficient_of_variation_threshold: Annotated[float, Ge(ge=0)] = 0.5)[source]¶
Bases:
BaseEvalConfigConfiguration for latency consistency evaluator.
- coefficient_of_variation_threshold: float¶
- max_latency_ms: float | None¶
- max_p95_ms: float | None¶
- max_p99_ms: float | None¶
- max_std_dev_ms: float | None¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class aiobs.evals.LatencyConsistencyEval(config: LatencyConsistencyConfig | None = None)[source]¶
Bases:
BaseEvalEvaluator that checks latency consistency across multiple runs.
This evaluator analyzes latency data to ensure: - Individual latencies are within acceptable bounds - Latency variation (std dev, CV) is acceptable - P95/P99 latencies are within bounds
The latency data should be provided in the eval_input.metadata dict under the key ‘latencies’ (list of floats in ms), or passed via kwargs.
Example
- config = LatencyConsistencyConfig(
max_latency_ms=5000, max_p95_ms=4000, coefficient_of_variation_threshold=0.3
) evaluator = LatencyConsistencyEval(config)
- result = evaluator.evaluate(
- EvalInput(
user_input=”test query”, model_output=”test response”, metadata={“latencies”: [100, 120, 95, 110, 105]}
)
)
- config_class¶
alias of
LatencyConsistencyConfig
- description: str = 'Evaluates latency consistency across multiple runs'¶
- evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate latency consistency.
- Parameters:
eval_input – Input with latencies in metadata[‘latencies’].
**kwargs – Can contain ‘latencies’ list to override.
- Returns:
EvalResult indicating pass/fail with latency statistics.
- name: str = 'latency_consistency'¶
- classmethod with_thresholds(max_latency_ms: float | None = None, max_p95_ms: float | None = None, max_p99_ms: float | None = None, cv_threshold: float = 0.5) LatencyConsistencyEval[source]¶
Create evaluator with specific thresholds.
- Parameters:
max_latency_ms – Maximum acceptable latency.
max_p95_ms – Maximum acceptable P95 latency.
max_p99_ms – Maximum acceptable P99 latency.
cv_threshold – Maximum coefficient of variation.
- Returns:
Configured LatencyConsistencyEval instance.
- class aiobs.evals.PIIDetectionConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, detect_types: ~typing.List[~aiobs.evals.models.configs.PIIType] = <factory>, custom_patterns: ~typing.Dict[str, str] = <factory>, redact: bool = False, fail_on_detection: bool = True, check_input: bool = False, check_system_prompt: bool = False)[source]¶
Bases:
BaseEvalConfigConfiguration for PII detection evaluator.
- check_input: bool¶
- check_system_prompt: bool¶
- custom_patterns: Dict[str, str]¶
- fail_on_detection: bool¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- redact: bool¶
- class aiobs.evals.PIIDetectionEval(config: PIIDetectionConfig | None = None)[source]¶
Bases:
BaseEvalEvaluator that detects PII in model outputs.
Detects common PII patterns including: - Email addresses - Phone numbers (US format) - Social Security Numbers (SSN) - Credit card numbers - IP addresses - Custom patterns
Example
- config = PIIDetectionConfig(
detect_types=[PIIType.EMAIL, PIIType.PHONE, PIIType.SSN], fail_on_detection=True
) evaluator = PIIDetectionEval(config)
- result = evaluator.evaluate(
- EvalInput(
user_input=”What’s your email?”, model_output=”You can reach me at john@example.com”
)
) # result.failed == True (email detected)
- DEFAULT_PATTERNS: Dict[PIIType, str] = {PIIType.CREDIT_CARD: '\\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})\\b', PIIType.DATE_OF_BIRTH: '\\b(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12][0-9]|3[01])[/-](?:19|20)\\d{2}\\b', PIIType.EMAIL: '\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b', PIIType.IP_ADDRESS: '\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b', PIIType.PHONE: '\\b(?:\\+?1[-.\\s]?)?(?:\\(?[0-9]{3}\\)?[-.\\s]?)?[0-9]{3}[-.\\s]?[0-9]{4}\\b', PIIType.SSN: '\\b(?!000|666|9\\d{2})\\d{3}[-\\s]?(?!00)\\d{2}[-\\s]?(?!0000)\\d{4}\\b'}¶
- REDACTION_MASKS: Dict[PIIType, str] = {PIIType.ADDRESS: '[ADDRESS REDACTED]', PIIType.CREDIT_CARD: '[CREDIT CARD REDACTED]', PIIType.CUSTOM: '[PII REDACTED]', PIIType.DATE_OF_BIRTH: '[DOB REDACTED]', PIIType.EMAIL: '[EMAIL REDACTED]', PIIType.IP_ADDRESS: '[IP REDACTED]', PIIType.NAME: '[NAME REDACTED]', PIIType.PHONE: '[PHONE REDACTED]', PIIType.SSN: '[SSN REDACTED]'}¶
- config_class¶
alias of
PIIDetectionConfig
- classmethod default(fail_on_detection: bool = True) PIIDetectionEval[source]¶
Create evaluator with default PII types.
- Parameters:
fail_on_detection – Whether to fail if PII is found.
- Returns:
Configured PIIDetectionEval instance.
- description: str = 'Detects personally identifiable information in outputs'¶
- evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate model output for PII.
- Parameters:
eval_input – Input containing model_output to check.
**kwargs – Additional arguments (unused).
- Returns:
EvalResult indicating pass (no PII) or fail (PII detected).
- name: str = 'pii_detection'¶
- redact(text: str) str[source]¶
Redact PII from text (convenience method).
- Parameters:
text – Text to redact.
- Returns:
Text with PII redacted.
- scan(text: str) List[PIIMatch][source]¶
Scan text for PII (convenience method).
- Parameters:
text – Text to scan.
- Returns:
List of PIIMatch objects.
- classmethod strict() PIIDetectionEval[source]¶
Create evaluator that checks all PII types.
- Returns:
Configured PIIDetectionEval instance.
- classmethod with_custom_patterns(patterns: Dict[str, str], fail_on_detection: bool = True) PIIDetectionEval[source]¶
Create evaluator with custom patterns.
- Parameters:
patterns – Dictionary mapping names to regex patterns.
fail_on_detection – Whether to fail if PII is found.
- Returns:
Configured PIIDetectionEval instance.
- class aiobs.evals.PIIType(value)[source]¶
Bases:
str,EnumTypes of PII to detect.
- ADDRESS = 'address'¶
- CREDIT_CARD = 'credit_card'¶
- CUSTOM = 'custom'¶
- DATE_OF_BIRTH = 'date_of_birth'¶
- EMAIL = 'email'¶
- IP_ADDRESS = 'ip_address'¶
- NAME = 'name'¶
- PHONE = 'phone'¶
- SSN = 'ssn'¶
- class aiobs.evals.RegexAssertion(config: RegexAssertionConfig | None = None)[source]¶
Bases:
BaseEvalEvaluator that asserts model output matches regex patterns.
This evaluator checks if the model output matches specified regex patterns and does NOT match negative patterns.
Example
# Check that output contains an email config = RegexAssertionConfig(
patterns=[r”[w.-]+@[w.-]+.w+”], case_sensitive=False
) evaluator = RegexAssertion(config)
- result = evaluator.evaluate(
- EvalInput(
user_input=”Give me an email”, model_output=”Contact us at support@example.com”
)
) assert result.passed
# Check that output does NOT contain certain words config = RegemaAssertionConfig(
negative_patterns=[r”b(sorry|cannot|unable)b”], case_sensitive=False
)
- config_class¶
alias of
RegexAssertionConfig
- description: str = "Asserts that output matches/doesn't match regex patterns"¶
- evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate if model output matches regex patterns.
- Parameters:
eval_input – Input containing model_output to check.
**kwargs – Additional arguments (unused).
- Returns:
EvalResult indicating pass/fail.
- classmethod from_patterns(patterns: List[str] | None = None, negative_patterns: List[str] | None = None, case_sensitive: bool = True, match_mode: str = 'any') RegexAssertion[source]¶
Create evaluator from pattern lists.
- Parameters:
patterns – Patterns that must match.
negative_patterns – Patterns that must NOT match.
case_sensitive – Whether matching is case-sensitive.
match_mode – ‘any’ or ‘all’ for positive patterns.
- Returns:
Configured RegexAssertion instance.
- name: str = 'regex_assertion'¶
- class aiobs.evals.RegexAssertionConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, patterns: ~typing.List[str] = <factory>, negative_patterns: ~typing.List[str] = <factory>, case_sensitive: bool = True, match_mode: str = 'any')[source]¶
Bases:
BaseEvalConfigConfiguration for regex assertion evaluator.
- case_sensitive: bool¶
- get_compiled_negative_patterns() List[Pattern[str]][source]¶
Get compiled negative regex patterns.
- Returns:
List of compiled regex Pattern objects.
- get_compiled_patterns() List[Pattern[str]][source]¶
Get compiled regex patterns.
- Returns:
List of compiled regex Pattern objects.
- match_mode: str¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- negative_patterns: List[str]¶
- patterns: List[str]¶
- class aiobs.evals.SchemaAssertion(config: SchemaAssertionConfig)[source]¶
Bases:
BaseEvalEvaluator that asserts model output matches a JSON schema.
This evaluator validates that the model output is valid JSON and conforms to the specified JSON Schema.
Example
- schema = {
“type”: “object”, “properties”: {
“name”: {“type”: “string”}, “age”: {“type”: “integer”, “minimum”: 0}
}, “required”: [“name”, “age”]
}
config = SchemaAssertionConfig(schema=schema) evaluator = SchemaAssertion(config)
- result = evaluator.evaluate(
- EvalInput(
user_input=”Extract person info”, model_output=’{“name”: “John”, “age”: 30}’
)
) assert result.passed
- config_class¶
alias of
SchemaAssertionConfig
- description: str = 'Asserts that output is valid JSON matching a schema'¶
- evaluate(eval_input: EvalInput, **kwargs: Any) EvalResult[source]¶
Evaluate if model output matches JSON schema.
- Parameters:
eval_input – Input containing model_output to validate.
**kwargs – Additional arguments (unused).
- Returns:
EvalResult indicating pass/fail.
- classmethod from_schema(schema: Dict[str, Any], strict: bool = True, extract_json: bool = True) SchemaAssertion[source]¶
Create evaluator from a schema dict.
- Parameters:
schema – JSON Schema dictionary.
strict – Whether to fail on additional properties.
extract_json – Whether to extract JSON from markdown.
- Returns:
Configured SchemaAssertion instance.
- name: str = 'schema_assertion'¶
- class aiobs.evals.SchemaAssertionConfig(*, name: str | None = None, fail_fast: bool = False, include_details: bool = True, json_schema: Dict[str, Any], strict: bool = True, parse_json: bool = True, extract_json: bool = True)[source]¶
Bases:
BaseEvalConfigConfiguration for JSON schema assertion evaluator.
- extract_json: bool¶
- json_schema: Dict[str, Any]¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- parse_json: bool¶
- strict: bool¶