There's a question I keep coming back to every time someone shows me a new AI agent demo: How do you know it's actually good?
Not "does it run without crashing" โ that's a low bar. Not "does it sound confident" โ that's worse than useless as a quality signal. The real question is: do you have a structured, repeatable way to score it, identify what's failing, and know when an iteration has actually made it better rather than just differently wrong?
If you're building or evaluating a customer-facing AI service agent, at some point "let's have people review some conversations" stops being enough. Different reviewers reach different conclusions about the same transcript. Feedback is vague. There's no shared definition of what "good" looks like. This post is about a rubric-based approach that I think solves for that โ 3 tiers, 24 measures, a hard safety gate, and a triage system that tells you not just that something failed, but why and what to fix.
The core design decision: tiers, not a flat checklist
The most important structural decision is to organize your rubric into tiers with different consequences for failure โ rather than treating all measures as equally important.
An agent that gives a slightly robotic response is a different problem from an agent that discloses sensitive customer information. Treating those the same way in a scoring system masks the things that actually matter most. The rubric needs to reflect that some failures are catastrophic and others are just rough edges.
The scoring scale
A 0โ5 weighted scale works well here, where each measure's score contributes to an overall weighted percentage. But the meaning of each score point needs to be defined precisely upfront โ otherwise reviewers drift toward their own interpretations and your data becomes noise.
The most important principle to embed: score based on observable evidence only. For each measure, find something specific in the transcript or tool logs that justifies the score. If you can't point to it, score conservatively. "It probably checked the account" is not evidence. This sounds obvious but it's harder to enforce than it looks, especially under time pressure.
The measures worth watching most closely
Rather than walk through all 24, here are the ones I'd pay the most attention to early โ the measures that tend to reveal gaps you wouldn't otherwise catch.
Confidence Calibration (Tier 1). This is the measure that separates agents that are confidently wrong from ones that know what they don't know. A well-calibrated agent signals uncertainty, escalates instead of guessing, and avoids the authoritative-sounding incorrect answer. In practice, this is one of the harder behaviors to get right โ models are trained to be helpful, and "I'm not sure, let me escalate" reads as unhelpful to the model even when it's exactly the right answer.
Escalation Judgment (Tier 2). When should the agent hand off to a human? Too early and it's annoying. Too late (or not at all when it should) and it creates risk. A good scoring guide here defines five distinct scenarios that require escalation โ out-of-scope requests, repeated confusion, emotional distress, action requests, and risk flags โ and scores each conversation against each one explicitly.
Issue Detection Accuracy (Tier 2). Can the agent identify the actual problem the customer has, not just the surface symptom they expressed? A customer saying "my balance looks wrong" might be experiencing a posting delay, a fraud situation, a calculation error, or a misunderstanding of how holds work. These require very different responses. This measure scores how well the agent distinguishes between them โ and consistently reveals gaps in the underlying knowledge base.
Semantic Stability (Tier 3). An agent that says different things about the same policy in different turns of the same conversation is a serious problem โ it erodes trust and can lead to contradictory guidance. This one is easy to miss in informal reviews but shows up clearly in structured scoring, and turns out to be a reliable indicator of prompt design quality.
A process for using it consistently โ 10 steps
The rubric is only useful if it's applied consistently. Here's the process I'd structure around it:
Deployment thresholds
One of the most practically useful additions is setting explicit go/no-go thresholds before testing starts โ not after you've seen the scores. Without this, you end up in "but it's mostly good" conversations where everyone has a different idea of what "ready" means. Agreeing upfront removes that ambiguity.
The failure triage categories
This is the part most evaluation frameworks skip, and it's where a lot of the value is. Knowing a score is low only matters if it tells you why and therefore what to fix. Tag every low score (0โ2) with one of these five root cause categories:
Why this kind of structure matters
Without a rubric, conversations about agent quality tend to be vague. "It feels better." "The tone seems off." "There was a weird response in that one scenario." These observations aren't wrong โ they're just not actionable.
With a structured rubric, those same conversations change. Instead of "it feels off," you get something like: "Issue Detection Accuracy is consistently low on multi-intent scenarios โ it's catching the primary intent but missing the secondary one. That's a prompt gap." That's a conversation that leads somewhere.
There's also a use case before the agent is built: walking through the measures in the design phase forces you to think about edge cases, failure modes, and requirements that haven't been explicitly captured. Several Tier 1 safety considerations tend to surface this way โ things that weren't in the original requirements doc but clearly needed to be there.
A rubric forces specificity. And specificity is what separates "this agent needs work" from "this agent needs this specific thing fixed, here's the evidence, here's what good looks like."
Should you adapt this for your context?
Yes โ and the measures are the part that needs the most customization. The tier structure, scoring scale, and triage framework are broadly applicable. But the specific measures and their weights need to reflect what your agent is actually supposed to do. A customer service agent has very different failure modes from a coding assistant or a document summarizer.
A few things I'd make sure you have before you start scoring: clear per-measure scoring guides (not just measure names โ "Escalation Judgment" without a definition of when escalation is required is too subjective to score consistently); a calibration session where reviewers independently score the same few conversations before scoring at scale; and threshold agreement before anyone sees results.
If you're working through something like this and want to compare approaches, I'd be glad to hear how your context differs. Find me on LinkedIn.