Why use an LLM to judge another LLM instead of deterministic tests?

Because agent outputs aren't deterministic. Ask the same agent the same question twice and you'll get different phrasing, different structure, sometimes different conclusions. Deterministic assertions (output must contain X, output must be exactly Y) are too brittle for nondeterministic systems. An LLM-as-judge can evaluate whether the response is correct, complete, and well-structured even when the exact wording varies. It's the difference between checking spelling and checking comprehension.

How do you prevent the judge from being wrong?

We use a stronger model as the judge. Claude Sonnet 4.5 evaluates outputs from models that are typically less capable for the specific task being judged. The scoring rubric is explicit -- each eval scenario defines what constitutes a good response, what's acceptable, and what's a failure. The judge scores against the rubric, not its own preferences. We also track judge consistency over time. If the same output gets wildly different scores across runs, the rubric needs tightening.

What happens when an eval fails in CI?

The PR is blocked. Eval failures are treated the same as test failures -- the build is red, the merge is blocked, and someone has to investigate. The investigation usually reveals one of three things: the agent's behavior genuinely regressed (fix the agent), the eval scenario's expectations are outdated (update the scenario), or the judge's rubric is too strict (adjust the scoring). Most failures are in the second category -- the product moved forward but the eval didn't keep up.

Testing AI with AI

January 2026. Dashboard v2 had 40+ agents, nine tool types, and a skills marketplace. Users were having conversations that branched into web searches, code execution, file management, and background task delegation. The question we couldn’t answer: were the agents getting better or worse?

You can’t unit test an AI agent the way you test a function. Given the same input, calculateTax(100, 0.08) always returns 8. Given the same prompt, an AI agent returns different text every time. The output is nondeterministic. Traditional assertions don’t work. So we built a system where a smarter model judges a working model.

The problem with testing nondeterminism

Consider a simple scenario: a user asks our analytics agent to explain their GA4 traffic data. A good response should identify trends, mention specific numbers from the data, and suggest actionable next steps. A bad response should be vague, miss key data points, or hallucinate metrics that don’t exist.

How do you write an automated test for that?

You can’t assert on exact string matches – the wording changes every run. You can’t assert on response length – a long response isn’t necessarily better. You can check for the presence of certain keywords, but that’s a proxy for quality, not quality itself. “The response mentions ‘bounce rate’” tells you nothing about whether the bounce rate analysis was correct.

We needed a judge that could evaluate semantic quality, not just string patterns. We needed an AI to test the AI.

Architecture of the eval framework

The evaluation framework has three layers: scenarios, execution, and scoring.

Scenarios define what to test. Each scenario specifies an agent, an input prompt, context data (like a simulated GA4 dataset), and a scoring rubric. The rubric is the critical piece – it tells the judge exactly what constitutes a good response for this specific scenario.

A GA4 analytics scenario might define its rubric as: “The response must (1) identify the top three traffic sources by session count, (2) note any week-over-week changes greater than 10%, (3) suggest at least one concrete action based on the data, (4) not reference any metrics not present in the provided dataset.”

Execution runs the scenario against the actual agent API. The framework sends the prompt to the agent through the same REST endpoints that real users hit. No mocks, no shortcuts. The agent processes the request, calls whatever tools it needs (web search, code execution, etc.), and returns a response. That response is captured for scoring.

Scoring sends the agent’s response to Claude Sonnet 4.5 via OpenRouter, along with the original prompt, the context data, and the scoring rubric. The judge model evaluates the response against the rubric and returns a structured score with explanations for each criterion.

We chose Claude Sonnet 4.5 as the judge because it’s consistently strong at analytical evaluation. OpenRouter as the provider gave us fallback routing and cost tracking through our existing Langfuse pipeline.

The five eval scenarios

We started with five scenarios, chosen to cover different agent capabilities:

GA4 analytics evaluation. The analytics agent receives a simulated GA4 dataset and must produce accurate, actionable analysis. This scenario caught an interesting problem early: the agent was generating so much detailed analysis that it exceeded output token limits. We had to add incremental script guidance – instructions that told the agent to analyze data in passes rather than attempting comprehensive analysis in a single response.

Repository analysis evaluation. A technical agent receives a codebase summary and must identify architectural patterns, potential issues, and improvement opportunities. The rubric checks for accuracy (don’t invent dependencies that aren’t there) and specificity (don’t just say “consider adding tests” – identify which modules lack coverage).

VFS file sync evaluation. The file system agent creates, modifies, and deletes files in the virtual file system. The eval verifies that file operations complete correctly, that version history is maintained, and that the agent handles edge cases like duplicate filenames and path conflicts.

Intent detection evaluation. Given an ambiguous user message, the routing system must correctly identify which agent should handle it and which tools are likely needed. This eval doesn’t test agent output quality – it tests the classification layer that sits in front of agents.

Skill delegation evaluation. A compound scenario where the agent must recognize that a task requires a specific skill, delegate to it, and incorporate the output into its response. This tests the full chain: intent detection, skill matching, delegation, and integration.

The ci/env split

Not all evals can run on every pull request. The VFS file sync eval needs a running E2B sandbox. The repository analysis eval needs network access to clone repositories. The GA4 eval needs a simulated analytics dataset that takes 30 seconds to generate.

Running all evals on every PR would add 5-10 minutes to the CI pipeline. For a team shipping two releases per day, that’s not acceptable.

We split evals into two groups:

ci group evals are fast, self-contained, and run on every pull request in GitHub Actions. Intent detection and skill delegation evals fall here – they’re quick, don’t need external services, and catch the most common regressions (prompt changes that break routing logic).

env group evals require sandbox environments, network access, or expensive model calls. These run nightly on a schedule. GA4 analytics, repository analysis, and VFS file sync evals run here. Failures trigger alerts, not PR blocks.

The split means PRs stay fast (ci evals add about 90 seconds) while comprehensive evaluation still happens daily. If a nightly eval fails, the team investigates in the morning before the first release of the day.

The GitHub Pages dashboard

Eval results are only useful if they’re visible. A JSON file in a CI artifact that nobody downloads isn’t observability – it’s an obligation.

We built an eval-reports pipeline that publishes results to GitHub Pages. Every eval run produces a report with:

Pass/fail status per scenario
Judge scores per rubric criterion
Score trends over time (is this scenario getting better or worse?)
Full judge explanations for failed criteria
Agent response text for manual review

The dashboard is static HTML generated from eval results. No backend, no database, no hosting costs beyond GitHub Pages. The trend charts show whether a change two weeks ago caused a gradual quality decline that individual evals didn’t flag.

What the evals caught

The evaluation framework justified its existence within the first week.

Background task loop detection. An eval scenario revealed that when an agent delegated a task to a background worker, and the worker’s output triggered the agent to delegate again, the system entered an infinite loop. Each iteration accumulated token costs. The judge noted that the response referenced “delegating to background task” multiple times without a final answer. We added auto-retry loop detection as a direct result.

Output token overflow. The GA4 analytics eval consistently scored poorly on completeness because the agent tried to analyze the entire dataset in one pass, hit the output token limit, and returned truncated analysis. The fix was incremental script guidance – system prompt additions that told the agent to analyze data in focused passes rather than all at once.

Skill delegation confusion. The intent detection eval revealed that certain phrasings caused the router to delegate to the wrong skill. “Help me write a blog post about data analysis” was being routed to the analytics agent instead of the writing agent because “data analysis” triggered the analytics classifier. Adjusting the intent detection prompts fixed the routing.

The philosophy

There’s a deeper principle behind LLM-as-judge evaluation: if you can’t measure agent quality, you can’t improve it.

Traditional software has deterministic tests as a quality baseline. You know the code works because the tests pass. AI agents don’t have that luxury. Without evals, quality assessment is vibes-based – someone manually reads agent responses and decides whether they feel right. That doesn’t scale, it doesn’t catch regressions, and it doesn’t provide data for improvement.

The eval framework gives us a quality signal. Not a perfect signal – the judge has its own biases and blind spots. But a consistent, automated, trend-trackable signal that turns agent quality from a subjective impression into a measurable metric.

Limitations and what’s next

The framework has real limitations. The judge model costs money – every eval run burns tokens on Claude Sonnet 4.5. The rubrics require human maintenance – when a feature changes, someone has to update the eval rubric. And the judge can be fooled by responses that are eloquently wrong – well-structured nonsense scores higher than poorly structured truth.

We’re exploring two directions: adversarial evals where the scenario deliberately tries to make the agent fail (prompt injection, contradictory instructions, impossible requests), and user-feedback-calibrated scoring where we compare judge scores against actual user satisfaction signals.

The lesson from building this framework is straightforward: testing AI with AI isn’t a hack. It’s the only scalable approach to quality assurance for nondeterministic systems. The alternative is hoping your agents work. Hope isn’t an engineering strategy.

Testing AI with AI

The problem with testing nondeterminism

Architecture of the eval framework

The five eval scenarios

The ci/env split

The GitHub Pages dashboard

What the evals caught

The philosophy

Limitations and what’s next

Related Posts

5,876 Commits Across Three AI Products

Building Custom GPTs Before OpenAI Did

Forty Agents and Counting

See what AIWAYZ can do for your team

Products

Solutions

Company

Legal