← Back to Writing
Test Automation Β· Architecture

One Way to Generate Test Cases Using Agents β€”
A RAG + Few-Shot Approach Worth Exploring

Natarajan (Nattu) Ramakrishnan May 2025 10 min read

I've been thinking about a pattern that might solve a recurring problem in AI-assisted testing: how do you get an AI to generate test cases that actually match your team's standards, not just generic "correct" output? The approach below is something I explored β€” partly built, partly hypothetical β€” and I think it's worth sharing as a potential model for anyone doing serious agent testing work.

The problem with naive test case generation

The simplest approach fails in a predictable way. Here's what that looks like:

❌ Naive Prompt Approach
Prompt:
"You are a QA engineer.
Write test cases for:
'User resets password
via email link.'"

What you get:
Generic steps. Wrong format.
Misses your domain rules.
Looks fine. Isn't.
βœ… RAG + Few-Shot Approach
Prompt:
"Here are 3 similar req→TC
pairs from our history:
[retrieved examples in our format]

Now write test cases for:
'User resets password
via email link.'"

What you get:
Your format. Your style. Usable.

The insight: the model doesn't need to be taught QA β€” it needs to be shown your version of QA. Retrieved examples do that better than any system prompt.

How this differs from standard RAG

Dimension Standard RAG This Approach
What's retrieved Relevant text passages Structured (Requirement β†’ Test Case) pairs
How it's used Context for the model to summarize/reason Worked examples β€” model performs same transformation
Output format Depends on model defaults Mirrors the format of retrieved examples
Gets better over time? Only if KB is updated manually Yes β€” via human curation feedback loop
Domain knowledge Whatever's in the KB text Embedded in the example pairs themselves

The system architecture

RAG Test Case Engine β€” How the pieces connect
πŸ“‹ Existing
Req/TC Pairs
Your historical work
β†’
βš™οΈ Ingest Pipeline
Embed Β· Enrich Β· Deduplicate
β†’
πŸ—„οΈ Vector DB
Pairs + weights + metadata
↑ Curated output written back (higher priority)                         ↓ Retrieve similar pairs
♻️ Human Curation
Edit Β· Approve Β· Rate
β†’
✏️ Review UI
See prompt + output
β†’
πŸ€– LLM
Few-shot generation
↑ New requirement enters bottom-right Β· curated output exits bottom-left Β· loops back to top

The three core components

πŸ—„οΈ
Knowledge Base
A vector DB of (requirement, test cases) pairs from your team's real history. Stores format type, domain tags, and curation weightage per entry. Deduplicates on similarity score at ingest β€” so it stays clean.
Chroma / any vector store
🧩
Generation Pipeline
Retrieves top-N similar pairs for a new requirement. Builds a few-shot prompt: "here are examples of how we do this, now do it for this requirement." Shows tester the full prompt β€” not just the output.
Python orchestrator + LLM API
♻️
Feedback Loop
When a tester edits and approves output, it's written back to the KB with elevated weight. Thumbs up/down adjusts retrieval priority. System improves without retraining β€” purely through human signal.
The most important part

The feedback loop in detail

This is the part I think most teams would skip β€” and the part that makes the whole thing work.

1
New requirement in
System retrieves the top 3–5 most similar (req, TC) pairs from the KB. These become the few-shot examples in the prompt.
2
Tester sees the prompt
Before the output, the tester sees which examples were retrieved and why. Full transparency β€” no black box. Makes debugging instant.
3
Curate the output
Edit what's wrong, approve what's right. Rate overall quality. The curation work IS the quality investment β€” not an afterthought.
4
Better next time
Approved output re-enters KB with elevated priority. Future similar requirements get better examples. Quality compounds without retraining.

What the tester experience looks like

Step-by-step tester workflow
β‘  Upload
Spreadsheet of
new requirements
β†’
β‘‘ Inspect Prompt
See which KB examples
were retrieved + why
β†’
β‘’ Review Output
Generated TCs
per requirement
β†’
β‘£ Curate + Export
Edit Β· Rate Β· Download
as .xlsx

Why this is a general pattern, not just a test case tool

The underlying structure is: learn (A, B) pairs β†’ retrieve similar pairs β†’ use as few-shot examples β†’ generate new B β†’ curate β†’ improve. That pattern applies anywhere you have a learnable transformation:

Input (A) Output (B) Domain Difficulty
Requirement Test cases QA / Testing Low β€” well-structured inputs
Business requirement User story Product / BA Low β€” clear transformation
Bug description Root cause category Support / Triage Medium β€” needs good label examples
Code diff Review comment Engineering Medium β€” style is opinionated
Policy document Compliance check output Legal / Risk High β€” high stakes, low margin for error
Customer query Suggested response Customer service High β€” tone + policy + domain all matter

Honest tradeoffs

βœ“ What works well
  • Output matches your team's format from day one β€” if your KB is good
  • Gets measurably better as more curated pairs accumulate
  • Prompt transparency means failures are diagnosable, not mysterious
  • No retraining required β€” all improvement is through retrieval weighting
  • Generic pattern applies across many Aβ†’B transformation problems
β–³ Watch out for
  • Cold-start problem β€” needs a decent seed KB before it's useful (50+ quality pairs)
  • Curation discipline is hard to sustain β€” testers skip rating when under pressure
  • Retrieval degrades for edge cases outside the KB's coverage area
  • Non-functional requirements (performance, security) get weak output β€” they're hard to example-ify
  • Someone needs to own KB hygiene β€” bad examples poison future retrievals

If you try this β€” things I'd do differently

1
Seed the KB before you go live. Don't bootstrap it from user curation alone. Manually curate 50–100 high-quality pairs that represent the full range of requirements you'll encounter. The system's quality ceiling is set by the seed quality.
2
Design your metadata schema upfront. Format type (BDD vs plain text), domain tags, feature area β€” add these at ingest, not reactively. Retrieval quality for cross-domain requirements depends on it.
3
Make curation feel rewarding, not like admin. Show testers their ratings improving future output quality. Frame it as: "your curation makes the tool smarter for you next time" β€” not just a quality signal you're collecting.
4
Show the prompt, always. Transparency about which examples were retrieved is what makes failures debuggable. An opaque system generates distrust; a visible one generates trust even when it's wrong.
5
Treat ambiguous requirements as a feature, not a failure. When the system struggles to generate clean output for a requirement, that's usually a signal the requirement is under-specified β€” not that the AI is broken.

Is this the right model for agent testing more broadly?

I think it might be. The core insight β€” use curated human examples as few-shot context, then let human feedback improve the retrieval priority over time β€” seems broadly applicable to anywhere you're trying to get an AI to perform a structured transformation consistently.

πŸ’­
The hypothesis I'm sitting with: Most AI testing failures aren't model failures β€” they're knowledge representation failures. The model could do the right thing if it had better examples. RAG-based approaches with human feedback loops might be the most pragmatic way to give it those examples without retraining.

If you're experimenting with something similar β€” or if you think this hypothesis is wrong β€” I'd genuinely love to hear about it. Find me on LinkedIn.