Had a conversation last week with a growth-stage company that had just signed a six-figure annual contract with an AI agent vendor. Six months in, the agent was handling maybe 15% of what they'd been promised. The demo had been flawless. The production reality was something else entirely.
They'd made the same mistake most companies make: they evaluated the agent on the vendor's terms, not their own. Demo environments are curated. Production isn't. The gap between those two things is where most AI agent investments quietly fail — not with a bang, but with a steady drip of underperformance that nobody has a framework to measure.
That's a solvable problem. You just need to know what questions to ask before you sign.
1 Why your demo is lying to you
Vendor demos have a specific structure: clean inputs, happy-path outputs, the best possible version of the agent. They show you the 20% that works brilliantly. They do not show you the 80% that determines whether the investment pays off.
The problems you'll face in production are almost never in the demo:
Dirty inputs. Real-world data is inconsistent, malformed, and full of things the agent wasn't trained to handle. Demo data is always clean. Ask the vendor: what happens when the input has missing fields, duplicate entries, or formatting that doesn't match their expected schema?
Edge cases. Vendors pick demo scenarios that show the agent at its best. Ask: what percentage of your production traffic falls outside the happy path? What does your error rate look like for inputs that are ambiguous, contradictory, or outside the training distribution?
Rate limits. The demo runs on a low-traffic environment. In production with real volume, rate limits hit. Ask: what happens to in-flight tasks when the API hits a limit? Does it retry with backoff, fail silently, or queue indefinitely?
Timeout behavior. Long-running tasks are where agents fail silently. Ask: what happens if a task takes longer than expected? Is there a timeout? What gets logged? What does the user see?
The demo tells you the ceiling. You need to understand the floor before you commit.
2 The 5 dimensions that actually matter
When I'm evaluating an agent for a client, I score it across five dimensions. Vendor demos optimize for none of them — which is exactly why you need to ask the questions yourself.
1. Reliability (completion rate)
What percentage of tasks does the agent complete end-to-end without human intervention? A 95% completion rate sounds great until you realize that the 5% of failures are your highest-stakes tasks — which is usually how it works. Ask for the actual completion rate on tasks similar to yours, not on the vendor's benchmark dataset.
2. Latency (task duration)
What's the 95th-percentile task duration, not the median? Medians hide tail failures. If your workflow has SLA requirements, the 95th-percentile matters more than the average. Also ask: how does latency scale as load increases? Does the agent parallelize, or does everything queue through a single processing thread?
3. Cost per task
What's the actual cost to process a single task at your expected volume, including retries, error handling, and API overhead? Vendors quote list price. Real cost per task includes: token consumption at your volume tier, retry overhead on failures, and any additional infrastructure cost for monitoring or human-in-the-loop correction. Get a cost model, not a price sheet.
4. Error recovery
When the agent encounters something it can't handle, what happens? Good agents fail gracefully and log meaningfully. Bad agents fail silently and produce output that looks correct until someone notices three days later. Ask specifically: what error states can occur, how are they surfaced, and what's the recovery path for each?
5. Integration depth
How well does the agent connect to your existing systems? Most vendors show their API works. What they don't show is: how idempotent are the operations if a task runs twice? What's the rollback story if an integration call partially succeeds? How does the agent handle auth token expiration mid-task? Integration depth is where most agent deployments quietly break in production.
3 The internal scoring rubric
When I'm working with teams through vendor evaluation, I use a simplified rubric they can run internally without needing a PhD in ML. It cuts the evaluation to a score that makes procurement conversations straightforward.
| Dimension | Score: 2 (Good) | Score: 1 (OK) | Score: 0 (Fail) |
|---|---|---|---|
| Reliability | >90% completion, no human review needed | 75–90% completion, occasional review | <75%, or failure mode unknown |
| Latency | P95 within SLA; scales with load | P95 acceptable, scaling unclear | P95 exceeds SLA; no parallelization |
| Cost per task | At or below estimated human cost | Above human cost, but manageable | Cost model unavailable or prohibitively high |
| Error recovery | Graceful failure, logs explain why, retry with backoff | Fails with notification, partial retry logic | Silent failures; no visibility into error state |
| Integration depth | Idempotent ops, rollback story, auth handled | Works on happy path; edge cases risky | Requires custom middleware or significant workarounds |
Score each dimension 0–2. Maximum score is 10. Any vendor scoring below 6 should be rejected regardless of how impressive the demo looked. A score of 8–10 means the vendor has actually done the production engineering that makes agents reliable at scale.
The rubric also works as a negotiation tool. If a vendor scores 5 on error recovery, ask specifically what their error handling roadmap looks like. A vendor with a 4 on a critical dimension may be willing to improve it — but only if you ask, and only if you have the score to justify the request.
4 When to build, when to buy, when to orchestrate
Not every agent problem requires buying an agent. The evaluation framework should also answer the build-vs-buy question before you spend cycles on vendor comparison.
Buy when: the task is commoditized, the vendor has proven production depth, and your volume is high enough to justify the contract. Commoditized tasks — document ingestion, standard language model operations, basic data classification — are better bought than built. The tooling is mature enough that you're not buying a capability advantage; you're buying time.
Build when: the task is differentiated and the agent is a core part of your competitive advantage. A customer-facing support agent that has access to your proprietary knowledge base is not a commodity. Neither is a code-review agent trained on your specific codebase patterns. These tasks require custom development because the training data and domain logic are unique to you.
Orchestrate when: no single agent handles your workflow end-to-end. This is the pattern I see most often in production: a sequence of agents handling different stages of a process, with humans in the loop at the critical handoff points. Orchestration doesn't mean buying one vendor's agent stack. It means designing a workflow where specialized agents handle specialized tasks, and the orchestration layer manages the state between them.
The worst outcome is buying a monolithic agent platform and trying to force every task through it because that's what you paid for. The evaluation should tell you whether the vendor's agent can handle your specific workflows, not whether it can handle a general category of tasks that happens to include yours.
Most companies don't have a procurement framework for AI agents. They evaluate on demo quality, which is exactly backwards. The demo tells you what the vendor wants you to see. The five dimensions tell you what production will actually look like.
Run the rubric before you sign. It's a week of work that might save you a six-figure contract that doesn't deliver.
Need help running the evaluation? I work with teams on exactly this — vendor assessment, scoring framework, and the build-vs-buy-vs-orchestrate decision. The conversation starts at the link below.
Related Reading
More on evaluating and deploying AI agents in production:
→ The AI Agent Evaluation Gap — Why most teams ship blind, and how eval infrastructure changes the picture.
→ Compound AI Systems — Why single-agent architectures hit walls and what orchestration looks like in practice.
→ The ROI of AI Agents — What to actually measure once you've deployed — and how to know if it's working.
→ Why Most AI Agent Projects Fail — The three things production teams do differently from day one.