I've been thinking about a pattern that might solve a recurring problem in AI-assisted testing: how do you get an AI to generate test cases that actually match your team's standards, not just generic "correct" output? The approach below is something I explored β partly built, partly hypothetical β and I think it's worth sharing as a potential model for anyone doing serious agent testing work.
The problem with naive test case generation
The simplest approach fails in a predictable way. Here's what that looks like:
"You are a QA engineer.
Write test cases for:
'User resets password
via email link.'"
What you get:
Generic steps. Wrong format.
Misses your domain rules.
Looks fine. Isn't.
"Here are 3 similar reqβTC
pairs from our history:
[retrieved examples in our format]
Now write test cases for:
'User resets password
via email link.'"
What you get:
Your format. Your style. Usable.
The insight: the model doesn't need to be taught QA β it needs to be shown your version of QA. Retrieved examples do that better than any system prompt.
How this differs from standard RAG
| Dimension | Standard RAG | This Approach |
|---|---|---|
| What's retrieved | Relevant text passages | Structured (Requirement β Test Case) pairs |
| How it's used | Context for the model to summarize/reason | Worked examples β model performs same transformation |
| Output format | Depends on model defaults | Mirrors the format of retrieved examples |
| Gets better over time? | Only if KB is updated manually | Yes β via human curation feedback loop |
| Domain knowledge | Whatever's in the KB text | Embedded in the example pairs themselves |
The system architecture
Req/TC Pairs
Your historical work
Embed Β· Enrich Β· Deduplicate
Pairs + weights + metadata
Edit Β· Approve Β· Rate
See prompt + output
Few-shot generation
The three core components
The feedback loop in detail
This is the part I think most teams would skip β and the part that makes the whole thing work.
What the tester experience looks like
Spreadsheet of
new requirements
See which KB examples
were retrieved + why
Generated TCs
per requirement
Edit Β· Rate Β· Download
as .xlsx
Why this is a general pattern, not just a test case tool
The underlying structure is: learn (A, B) pairs β retrieve similar pairs β use as few-shot examples β generate new B β curate β improve. That pattern applies anywhere you have a learnable transformation:
| Input (A) | Output (B) | Domain | Difficulty |
|---|---|---|---|
| Requirement | Test cases | QA / Testing | Low β well-structured inputs |
| Business requirement | User story | Product / BA | Low β clear transformation |
| Bug description | Root cause category | Support / Triage | Medium β needs good label examples |
| Code diff | Review comment | Engineering | Medium β style is opinionated |
| Policy document | Compliance check output | Legal / Risk | High β high stakes, low margin for error |
| Customer query | Suggested response | Customer service | High β tone + policy + domain all matter |
Honest tradeoffs
- Output matches your team's format from day one β if your KB is good
- Gets measurably better as more curated pairs accumulate
- Prompt transparency means failures are diagnosable, not mysterious
- No retraining required β all improvement is through retrieval weighting
- Generic pattern applies across many AβB transformation problems
- Cold-start problem β needs a decent seed KB before it's useful (50+ quality pairs)
- Curation discipline is hard to sustain β testers skip rating when under pressure
- Retrieval degrades for edge cases outside the KB's coverage area
- Non-functional requirements (performance, security) get weak output β they're hard to example-ify
- Someone needs to own KB hygiene β bad examples poison future retrievals
If you try this β things I'd do differently
Is this the right model for agent testing more broadly?
I think it might be. The core insight β use curated human examples as few-shot context, then let human feedback improve the retrieval priority over time β seems broadly applicable to anywhere you're trying to get an AI to perform a structured transformation consistently.
If you're experimenting with something similar β or if you think this hypothesis is wrong β I'd genuinely love to hear about it. Find me on LinkedIn.