One Way to Generate Test Cases Using Agents — A RAG + Few-Shot Approach Worth Exploring

I've been thinking about a pattern that might solve a recurring problem in AI-assisted testing: how do you get an AI to generate test cases that actually match your team's standards, not just generic "correct" output? The approach below is something I explored — partly built, partly hypothetical — and I think it's worth sharing as a potential model for anyone doing serious agent testing work.

The problem with naive test case generation

The simplest approach fails in a predictable way. Here's what that looks like:

Prompt:
"You are a QA engineer.
Write test cases for:
'User resets password
via email link.'"

What you get:
Generic steps. Wrong format.
Misses your domain rules.
Looks fine. Isn't.

Prompt:
"Here are 3 similar req→TC
pairs from our history:
[retrieved examples in our format]

Now write test cases for:
'User resets password
via email link.'"

What you get:
Your format. Your style. Usable.

The insight: the model doesn't need to be taught QA — it needs to be shown your version of QA. Retrieved examples do that better than any system prompt.

How this differs from standard RAG

Dimension	Standard RAG	This Approach
What's retrieved	Relevant text passages	Structured (Requirement → Test Case) pairs
How it's used	Context for the model to summarize/reason	Worked examples — model performs same transformation
Output format	Depends on model defaults	Mirrors the format of retrieved examples
Gets better over time?	Only if KB is updated manually	Yes — via human curation feedback loop
Domain knowledge	Whatever's in the KB text	Embedded in the example pairs themselves

The system architecture

RAG Test Case Engine — How the pieces connect

📋 Existing
Req/TC Pairs
Your historical work

→

⚙️ Ingest Pipeline
Embed · Enrich · Deduplicate

→

🗄️ Vector DB
Pairs + weights + metadata

↑ Curated output written back (higher priority) ↓ Retrieve similar pairs

♻️ Human Curation
Edit · Approve · Rate

→

✏️ Review UI
See prompt + output

→

🤖 LLM
Few-shot generation

↑ New requirement enters bottom-right · curated output exits bottom-left · loops back to top

The three core components

🗄️

Knowledge Base

A vector DB of (requirement, test cases) pairs from your team's real history. Stores format type, domain tags, and curation weightage per entry. Deduplicates on similarity score at ingest — so it stays clean.

Chroma / any vector store

🧩

Generation Pipeline

Retrieves top-N similar pairs for a new requirement. Builds a few-shot prompt: "here are examples of how we do this, now do it for this requirement." Shows tester the full prompt — not just the output.

Python orchestrator + LLM API

♻️

Feedback Loop

When a tester edits and approves output, it's written back to the KB with elevated weight. Thumbs up/down adjusts retrieval priority. System improves without retraining — purely through human signal.

The most important part

The feedback loop in detail

This is the part I think most teams would skip — and the part that makes the whole thing work.

New requirement in

System retrieves the top 3–5 most similar (req, TC) pairs from the KB. These become the few-shot examples in the prompt.

Tester sees the prompt

Before the output, the tester sees which examples were retrieved and why. Full transparency — no black box. Makes debugging instant.

Curate the output

Edit what's wrong, approve what's right. Rate overall quality. The curation work IS the quality investment — not an afterthought.

Better next time

Approved output re-enters KB with elevated priority. Future similar requirements get better examples. Quality compounds without retraining.

What the tester experience looks like

Step-by-step tester workflow

① Upload
Spreadsheet of
new requirements

→

② Inspect Prompt
See which KB examples
were retrieved + why

→

③ Review Output
Generated TCs
per requirement

→

④ Curate + Export
Edit · Rate · Download
as .xlsx

Why this is a general pattern, not just a test case tool

The underlying structure is: learn (A, B) pairs → retrieve similar pairs → use as few-shot examples → generate new B → curate → improve. That pattern applies anywhere you have a learnable transformation:

Input (A)	Output (B)	Domain	Difficulty
Requirement	Test cases	QA / Testing	Low — well-structured inputs
Business requirement	User story	Product / BA	Low — clear transformation
Bug description	Root cause category	Support / Triage	Medium — needs good label examples
Code diff	Review comment	Engineering	Medium — style is opinionated
Policy document	Compliance check output	Legal / Risk	High — high stakes, low margin for error
Customer query	Suggested response	Customer service	High — tone + policy + domain all matter

Honest tradeoffs

Output matches your team's format from day one — if your KB is good
Gets measurably better as more curated pairs accumulate
Prompt transparency means failures are diagnosable, not mysterious
No retraining required — all improvement is through retrieval weighting
Generic pattern applies across many A→B transformation problems

Cold-start problem — needs a decent seed KB before it's useful (50+ quality pairs)
Curation discipline is hard to sustain — testers skip rating when under pressure
Retrieval degrades for edge cases outside the KB's coverage area
Non-functional requirements (performance, security) get weak output — they're hard to example-ify
Someone needs to own KB hygiene — bad examples poison future retrievals

If you try this — things I'd do differently

Seed the KB before you go live. Don't bootstrap it from user curation alone. Manually curate 50–100 high-quality pairs that represent the full range of requirements you'll encounter. The system's quality ceiling is set by the seed quality.

Design your metadata schema upfront. Format type (BDD vs plain text), domain tags, feature area — add these at ingest, not reactively. Retrieval quality for cross-domain requirements depends on it.

Make curation feel rewarding, not like admin. Show testers their ratings improving future output quality. Frame it as: "your curation makes the tool smarter for you next time" — not just a quality signal you're collecting.

Show the prompt, always. Transparency about which examples were retrieved is what makes failures debuggable. An opaque system generates distrust; a visible one generates trust even when it's wrong.

Treat ambiguous requirements as a feature, not a failure. When the system struggles to generate clean output for a requirement, that's usually a signal the requirement is under-specified — not that the AI is broken.

Is this the right model for agent testing more broadly?

I think it might be. The core insight — use curated human examples as few-shot context, then let human feedback improve the retrieval priority over time — seems broadly applicable to anywhere you're trying to get an AI to perform a structured transformation consistently.

💭

The hypothesis I'm sitting with: Most AI testing failures aren't model failures — they're knowledge representation failures. The model could do the right thing if it had better examples. RAG-based approaches with human feedback loops might be the most pragmatic way to give it those examples without retraining.

If you're experimenting with something similar — or if you think this hypothesis is wrong — I'd genuinely love to hear about it. Find me on LinkedIn.