The Evals Problem: Measuring What Matters

A perfect eval score means nothing if your agent fails in production.

I've seen this play out dozens of times. A team builds an AI system, runs it through their evaluation suite, gets 95% accuracy, deploys it—and watches it fail spectacularly in the real world. The evals were measuring the wrong things.

The problem isn't that evaluations are useless. It's that most teams build evals as an afterthought, measuring whatever's easiest to quantify rather than what actually matters for their specific use case.

Why Standard Evals Fail

The default approach is straightforward: create a test set, run your model against it, measure accuracy or F1 score. It's clean. It's quantifiable. It's also almost always wrong.

Here's why. Standard evals optimize for average case performance. But production fails on edge cases. Your eval might test 100 normal requests and miss the 1% of unusual inputs that break everything.

Worse, most evals measure isolated capability rather than system behavior. You test if your model can extract a phone number from text. You don't test if it extracts the right phone number when there are five numbers in the document. You don't test what happens when it's uncertain. You don't test how it interacts with the rest of your system.

I built a document classification agent that scored 92% on my test set. In production, it was failing 30% of the time. Why? Because my eval tested clean PDFs. Real documents came in as images, scanned files, mixed formats. The eval never saw those cases.

What Actually Matters in Production

Before you write a single test case, you need to define what "working" means for your specific system.

This isn't about accuracy. It's about business impact. Different systems have completely different failure modes that matter.

For a content moderation agent, false negatives (missing bad content) might be catastrophic while false positives (flagging good content) are annoying. Your evals should weight these differently.

For a customer support agent, response time might matter as much as correctness. A perfect answer in 30 seconds is useless if your customer needed it in 2 seconds.

For a data extraction agent, partial correctness might be fine—you can fix 80% of the data faster than you can fix 0%. But for a financial compliance agent, 80% is worthless.

Define this first. Write it down. Then build evals that measure it.

Building Evals That Predict Production

Here's the framework I use:

1. Identify Failure Modes

What are the ways this system can fail? Not theoretically—what actually breaks in production?

For an agent I built that handles customer scheduling:

Missing availability windows (customer gets wrong time)
Booking outside business hours
Double-booking the same slot
Failing to handle special requests (accessibility needs, language preferences)
Not escalating when uncertain

Each of these has different consequences. Missing availability is annoying. Double-booking is a disaster. Not escalating when uncertain is preventable.

2. Weight Failures by Impact

Not all failures are equal. Create a simple scoring system.

Critical (blocks entire feature): -100 points
Severe (requires manual fix): -10 points
Minor (user notices, annoying): -1 point
Acceptable (edge case, low frequency): 0 points

Your eval should reflect this. A system that gets 99% accuracy but fails catastrophically 1% of the time should score lower than a system that gets 95% accuracy but fails gracefully.

3. Build Representative Test Sets

This is where most teams fail. They create test data that's too clean.

Your eval should include:

Happy path cases (20% of tests) - Standard requests that should work
Edge cases (40% of tests) - Unusual but realistic inputs
Adversarial cases (20% of tests) - Deliberately tricky inputs
Boundary cases (20% of tests) - Inputs at the limits of what your system handles

For the scheduling agent, this meant including:

Requests with typos and grammatical errors
Requests in different formats ("next Tuesday at 2pm" vs "2026-01-14 14:00")
Requests with conflicting information
Requests outside business hours
Requests in other languages

4. Test System Behavior, Not Just Output

Most evals test the model in isolation. Real systems have context.

Your eval should test:

Integration behavior - Does your agent work with your actual tools and APIs?
Error recovery - What happens when a tool fails?
Escalation - Does it escalate appropriately when uncertain?
State management - Does it handle multi-turn conversations correctly?
Latency - Does it respond fast enough?

I built a test harness that runs agents against real (sandboxed) APIs rather than mocked responses. This caught integration bugs that never showed up in unit tests.

5. Measure What Matters

Once you've defined failure modes and impact, measure the right metrics.

Don't just measure accuracy. Measure:

Precision and recall by failure mode - How often do you miss critical failures vs. create false alarms?
Cost of failures - What's the actual business impact?
Confidence calibration - Does the model's confidence correlate with correctness?
Performance distribution - Are failures concentrated in certain input types?

interface EvalResult {
  overallScore: number; // Weighted score
  byFailureMode: {
    missingAvailability: { precision: number; recall: number };
    doubleBooking: { precision: number; recall: number };
    // ... etc
  };
  costOfFailures: number; // Dollar impact
  confidenceCalibration: number; // Does confidence match accuracy?
  performanceByInputType: Record<string, number>;
}

The Iteration Loop

Here's what I've found works: tight feedback loops between evals and production.

Build your initial eval framework. Deploy. Monitor what actually fails. Add those failure cases to your eval. Iterate.

Your first eval will be wrong. That's fine. The goal is to get better at predicting production behavior.

I typically run this cycle every 2 weeks:

Run eval suite
Deploy to production
Collect real failures
Add 5-10 new test cases based on those failures
Rerun eval suite
Repeat

After a few cycles, your eval becomes a genuine predictor of production performance. The first eval catches maybe 40% of what will actually break. By cycle 5, it catches 85-90%.

The Hard Part: Honest Evals

The hardest part of building good evals isn't technical. It's honesty.

Most teams build evals they know their system will pass. They avoid edge cases. They test happy paths. They measure metrics that look good in presentations.

If your eval score is always going up but your production failures stay the same, your eval isn't measuring what matters.

I've started asking teams: "What would make you fail this eval? What edge cases are you avoiding?" If they can't articulate it, the eval is probably wrong.

Where to Start

If you're building AI systems right now:

Stop measuring generic metrics. Accuracy means nothing without context.
Define failure modes specific to your use case. What breaks? What's the cost?
Build representative test sets. Include edge cases, not just happy paths.
Test system behavior, not just model output. Integrate with real tools.
Iterate based on production data. Your eval should get better every deployment.

The teams shipping reliable AI systems aren't the ones with the highest eval scores. They're the ones with eval frameworks that actually predict production behavior.

Read about building reliable AI tools for patterns that work in production.

Want to discuss how to build evals for your specific use case? Get in touch.