laravel-ai-agent-evaluation maintained by casawatt

PACKAGE
VERSIONS

Description

Evaluate Laravel AI SDK agents across providers and models for performance, accuracy, and cost

Author

Olivier Pousset

Last update

2026/05/11 10:59 (dev-main)

License

MIT

Links

Homepage - GitHub - Packagist

Downloads

Tags

testing - laravel - benchmark - comparison - agent - ai - evaluation - llm

dev-main

Last update

2026/05/11 10:59

License

MIT

Require

php ^8.3
ext-pdo *
illuminate/contracts ^11.0||^12.0||^13.0
laravel/ai ^0.6
phpunit/phpunit ^11.0||^12.0
spatie/fork ^1.2
spatie/laravel-package-tools ^1.16

v0.6.0

Last update

2026/05/11 10:49

License

MIT

Require

php ^8.3
ext-pdo *
illuminate/contracts ^11.0||^12.0||^13.0
laravel/ai ^0.6
phpunit/phpunit ^11.0||^12.0
spatie/fork ^1.2
spatie/laravel-package-tools ^1.16

Comments

comments powered by Disqus

Laravel AI Agent Evaluation

Evaluate your Laravel AI SDK agents across multiple providers and models. Compare performance, accuracy, and cost.

Installation

composer require --dev casawatt/laravel-ai-agent-evaluation

Publish the config file:

php artisan vendor:publish --tag="laravel-ai-agent-evaluation-config"

Quick Start

1. Create an evaluation

php artisan make:agent-evaluation SalesCoach

This creates:

agent-evaluations/
    SalesCoachEvaluation.php
    results/
        .gitignore

2. Define your evaluation

Edit agent-evaluations/SalesCoachEvaluation.php:

<?php

use Laravel\Ai\Enums\Lab;
use Casawatt\LaravelAiAgentEvaluation\Attributes\EvaluationCase;
use Casawatt\LaravelAiAgentEvaluation\Evaluation;

return new class extends Evaluation
{
    protected string $agent = \App\Ai\Agents\SalesCoach::class;

    public function setUp(): void
    {
        $this->variant(Lab::Mistral, 'mistral-small-3.2-24b-instruct');
        $this->variant(Lab::OpenRouter, 'google/gemma-3-27b-it');
        $this->variant(Lab::OpenRouter, 'openai/gpt-oss-120b');
        $this->variant('scaleway', 'gpt-oss-120b');
    }

    #[EvaluationCase]
    public function contains_hello(): void
    {
        $this->agent(prompt: 'Say hello to the user')
            ->assertContains('hello');
    }

    #[EvaluationCase(description: 'Handles file attachments')]
    public function with_attachments(): void
    {
        $this->agent(
            prompt: 'Summarize this document',
            attachments: [
                'path/to/document.pdf',
                'https://example.com/data.csv',
            ],
        )->assertNotEmpty()
          ->assertMinLength(50);
    }
};

3. Run evaluations

php artisan agent-evaluation

Each case is run against every variant. Results are displayed in the console and persisted to SQLite.

Variants

Variants define the configurations to test against. Configure them in setUp():

public function setUp(): void
{
    $this->variant(Lab::OpenAI, 'gpt-4o-mini');
    $this->variant(Lab::Anthropic, 'claude-sonnet-4-20250514')
        ->label('Sonnet');
}

Each variant can have a custom label (used in the results table) and a custom instruction (overrides the agent's default system prompt):

public function setUp(): void
{
    $this->variant(Lab::OpenAI, 'gpt-4o-mini')
        ->label('Default instructions');

    $this->variant(Lab::OpenAI, 'gpt-4o-mini')
        ->label('Strict coach')
        ->instruction('You are a strict sales coach. Never offer discounts.');

    $this->variant(Lab::OpenAI, 'gpt-4o-mini')
        ->label('Lenient coach')
        ->instruction('You are a lenient sales coach. Offer discounts freely.');
}

Instructions can also be loaded from a file using the file:// prefix — useful for long or complex system prompts:

$this->variant(Lab::OpenAI, 'gpt-4o-mini')
    ->label('Strict coach')
    ->instruction('file://prompts/strict-coach.md');

Relative paths are resolved from the evaluations directory (agent-evaluations/ by default). Absolute paths are used as-is.

This lets you compare the same model with different instructions — useful for prompt engineering and evaluating system prompt variations.

Cost Tracking

Add pricing to variants to track cost per evaluation. Pricing is defined in dollars per million tokens:

$this->variant(Lab::OpenAI, 'gpt-4o-mini')
    ->pricing(inputPerMillion: 0.15, outputPerMillion: 0.60);

$this->variant(Lab::Anthropic, 'claude-sonnet-4-20250514')
    ->pricing(inputPerMillion: 3.00, outputPerMillion: 15.00);

When at least one variant has pricing, the variant summary table shows a Cost column. Variants without pricing show —.

Cost Resolvers

Instead of setting pricing on each variant, you can register cost resolvers that automatically look up pricing by provider and model. Resolvers are tried in order — the first non-null result wins. Explicit pricing() on a variant always takes precedence.

The package ships with two built-in resolvers:

Resolver	Source	Scope
`OpenRouterCostResolver`	OpenRouter API	`Lab::OpenRouter` variants only
`ModelsDevCostResolver`	models.dev	Any provider (OpenAI, Anthropic, Mistral, etc.)

Enable them in config/ai-agent-evaluation.php:

'cost_resolvers' => [
    \Casawatt\LaravelAiAgentEvaluation\CostResolvers\OpenRouterCostResolver::class,
    \Casawatt\LaravelAiAgentEvaluation\CostResolvers\ModelsDevCostResolver::class,
],

Each API is called once per run and cached in memory. Resolvers are tried in order — place more specific resolvers first.

You can create your own resolvers by implementing CostResolverInterface:

use Casawatt\LaravelAiAgentEvaluation\CostResolverInterface;
use Casawatt\LaravelAiAgentEvaluation\Price;
use Laravel\Ai\Enums\Lab;

class MyCostResolver implements CostResolverInterface
{
    public function resolve(Lab|string $provider, string $model): ?Price
    {
        // Return null if this resolver doesn't handle this provider
        // Return a Price with per-million-token costs otherwise
        return new Price(inputPerMillion: 0.15, outputPerMillion: 0.60);
    }
}

Custom providers (not in the Lab enum) work as long as they are configured in your config/ai.php.

Assertions

All assertions are chainable. The package returns an AssertableResponse for text agents and an AssertableStructuredResponse for agents implementing HasStructuredOutput — the correct type is detected automatically.

Text (AssertableResponse)

Method	Description
`assertContains(string $needle)`	Response contains the string
`assertNotContains(string $needle)`	Response does not contain the string
`assertContainsIgnoringCase(string $needle)`	Case-insensitive contains
`assertRegex(string $pattern)`	Response matches the regex
`assertNotRegex(string $pattern)`	Response does not match the regex
`assertStartsWith(string $prefix)`	Response starts with the string
`assertEndsWith(string $suffix)`	Response ends with the string
`assertExactly(string $expected)`	Response equals the string exactly
`assertEmpty()`	Response is empty
`assertNotEmpty()`	Response is not empty
`assertMinLength(int $min)`	Response has at least `$min` characters
`assertMaxLength(int $max)`	Response has at most `$max` characters

Structured Data (AssertableStructuredResponse)

Available automatically when your agent implements HasStructuredOutput. These work directly on the parsed $response->structured array — no JSON parsing needed.

Method	Description
`assertStructure(array $structure)`	Validates nested key structure (supports `*` wildcards)
`assertPath(string $path, mixed $expected)`	Value at dot-notation path equals expected
`assertPathContains(string $path, string $needle)`	String value at path contains the needle
`assertHasKey(string $key)`	Key exists (supports dot-notation)
`assertMissingKey(string $key)`	Key does not exist (supports dot-notation)
`assertCount(int $count)`	Top-level array has N entries
`assertWhere(string $path, callable $callback)`	Value at path satisfies callback

$this->agent(prompt: 'Describe the product')
    ->assertStructure(['name', 'price', 'tags' => ['*' => ['label']]])
    ->assertPath('name', 'Widget')
    ->assertPathContains('name', 'Wid')
    ->assertHasKey('price')
    ->assertMissingKey('deleted_at')
    ->assertCount(3)
    ->assertWhere('price', fn ($v) => $v > 0 && $v < 1000);

Tool Calls

Method	Description
`assertToolCalled(string $toolName)`	The tool was called during the response
`assertToolNotCalled(string $toolName)`	The tool was not called
`assertToolCalledTimes(string $toolName, int $times)`	The tool was called exactly N times

Performance

Method	Description
`assertLatencyBelow(float $maxSeconds)`	Response latency is below the threshold
`assertTokensBelow(int $maxTokens)`	Total tokens (input + output) is below the threshold

Custom

->assert(fn ($response) => str_word_count($response->text) > 10, 'Expected more than 10 words')

Weighted Assertions

Every assertion has a weight parameter (default 1.0). Assertions never throw — failures are recorded and execution continues, so all assertions in a case always run. The weight is a float between 0 and 1 that indicates the assertion's importance:

1.0 — full importance (default)
0.5 — half as important
0.1 — nice-to-have

#[EvaluationCase]
public function evaluates_response_quality(): void
{
    $this->agent(prompt: 'What is the capital of France?')
        ->assertContains('Paris')                       // weight: 1.0 (default)
        ->assertNotContains('London')                   // weight: 1.0 (default)
        ->assertMaxLength(200, weight: 0.3);            // less important
}

The variant summary table shows a Score column with weighted percentages:

| Variant      | Results       | Score              | Avg Latency | Tokens In | Tokens Out |
|--------------|---------------|--------------------|-------------|-----------|------------|
| openai/gpt4  | 3/4 (75.0%)   | 2.3/3.3 (69.7%)   | 320ms       | 200       | 3,000      |
| mistral/sm   | 2/4 (50.0%)   | 1.3/3.3 (39.4%)   | 210ms       | 180       | 2,500      |

Metrics

Every assertion accepts an optional metric tag to group assertions by quality dimension (e.g. accuracy, completeness, safety). Metrics aggregate scores across all cases within a suite, giving you a per-dimension breakdown by variant.

#[EvaluationCase]
public function knows_capital(): void
{
    $this->agent(prompt: 'What is the capital of France?')
        ->assertContains('Paris', metric: 'accuracy')
        ->assertMinLength(20, metric: 'completeness')
        ->assertMaxLength(200, metric: 'completeness');
}

#[EvaluationCase]
public function explains_concept(): void
{
    $this->agent(prompt: 'Explain gravity in simple terms')
        ->assertContains('force', metric: 'accuracy')
        ->assertMinLength(50, metric: 'completeness')
        ->assertNotContains('kill', metric: 'safety');
}

When any assertion has a metric, the console output includes an additional Metrics table:

| Metric        | openai/gpt-4o-mini | anthropic/claude-haiku |
|---------------|--------------------|------------------------|
| accuracy      | 2 / 2 (100.0%)     | 1 / 2 (50.0%)         |
| completeness  | 3 / 3 (100.0%)     | 2 / 3 (66.7%)         |
| safety        | 1 / 1 (100.0%)     | 1 / 1 (100.0%)        |

Metrics are persisted to storage alongside each assertion result, so they are available for web reporting without re-running evaluations.

Data Providers

Use #[With('methodName')] to feed multiple data sets into a single case — like PHPUnit's #[DataProvider]. The method can load data from anywhere: arrays, models, files, APIs.

use Casawatt\LaravelAiAgentEvaluation\Attributes\With;

#[EvaluationCase]
#[With('capitalCities')]
public function knows_capital(string $country, string $capital): void
{
    $this->agent(prompt: "What is the capital of {$country}?")
        ->assertContains($capital);
}

public function capitalCities(): array
{
    return [
        'france'  => ['France', 'Paris'],
        'germany' => ['Germany', 'Berlin'],
        'japan'   => ['Japan', 'Tokyo'],
    ];
}

Each data set becomes a separate row in the results: knows_capital (france), knows_capital (germany), etc. The keys are used as labels.

The provider method can return any data — query a database, read a CSV, call an API:

public function customerQuestions(): array
{
    return Customer::query()
        ->where('type', 'test')
        ->get()
        ->mapWithKeys(fn ($c) => [$c->name => [$c->question, $c->expected_answer]])
        ->all();
}

Skipping Cases

Some providers or models may not support certain features (e.g. tool use). You can skip cases conditionally using skip() or skipWhen():

use Casawatt\LaravelAiAgentEvaluation\Variant;

#[EvaluationCase]
public function uses_tools(): void
{
    $this->skipWhen(
        fn (Variant $v) => $v->provider === Lab::Mistral,
        'Mistral does not support tools',
    );

    $this->agent(prompt: 'Search for restaurants nearby')
        ->assertToolCalled('search');
}

skipWhen() accepts a boolean or a callable that receives the current Variant. You can also call skip() to unconditionally skip:

#[EvaluationCase]
public function not_ready_yet(): void
{
    $this->skip('Work in progress');
}

Skipped cases show S in the progress output and SKIP in the test matrix. They do not count as failures.

Parallel Execution

By default, cases run sequentially. Use --parallel to run multiple cases concurrently using spatie/fork (requires the pcntl extension):

# Run 4 cases in parallel
php artisan agent-evaluation --parallel=4

Each case runs in its own forked process with a fresh evaluation instance — no shared state. Results are persisted to storage by the parent process after each child completes.

You can set a default in config/ai-agent-evaluation.php:

'parallel' => 4,

The --parallel CLI option overrides the config value. Set to 1 for sequential execution.

Command Options

# Run all evaluations
php artisan agent-evaluation

# Filter by evaluation name
php artisan agent-evaluation --filter=SalesCoach

# Filter by variant label
php artisan agent-evaluation --variant=openai

# Run cases in parallel (requires pcntl)
php artisan agent-evaluation --parallel=4

# Resume an interrupted run (skip already-run cases)
php artisan agent-evaluation --resume

# Retry only errors from the latest run (API failures, timeouts)
php artisan agent-evaluation --retry-errors

Storage

Results are persisted to a SQLite database after each case completes — if the process is killed, all completed results are preserved. The database is stored at storage/ai-agent-evaluation/evaluations.sqlite by default.

// config/ai-agent-evaluation.php
'storage' => [
    'driver' => 'sqlite', // sqlite, file
    'path' => storage_path('ai-agent-evaluation'),
],

The package uses raw PDO with WAL mode — no Laravel database connection required. Safe for concurrent writes.

Resume and Retry

--resume — Loads the latest run from storage and skips all case+variant combos that already have results. Only missing cases are executed. Useful when a run was interrupted.

--retry-errors — Like resume, but re-executes cases with error status (API failures, timeouts). Passed/failed/skipped results are kept as-is.

Each result has a status: passed, failed, skipped, or error. Assertion failures are failed; exceptions are error.

Results

Console output includes up to three tables:

Test Matrix — each case x each variant with PASS/FAIL/ERROR/SKIP, latency, and token count.

Variant Summary — aggregated pass rate, average latency, total tokens, and cost per variant.

Metrics (when assertions use metric:) — per-metric score breakdown by variant.

Testing

composer test

Credits

License

The MIT License (MIT). Please see License File for more information.