10 KiB
Review & QA — Evaluating Agent Output
Agents are fast but not infallible. Your role shifts from "writer" to "reviewer." This guide teaches you when to check in, what to look for, and how to course-correct.
The New Developer Workflow
Traditional development:
Think → Write Code → Test → Debug → Commit
Agent-assisted development:
Spec → Agent Builds → You Review → Course-Correct → Agent Continues
Your job is no longer writing code. It's:
- Defining what to build (the spec)
- Reviewing what the agent built
- Course-correcting when it drifts
This is a fundamentally different skill. Most developers struggle with it at first because reviewing feels passive. It's not — it's the highest-leverage activity in the loop.
When to Review
Mandatory Review Points
| Moment | What to check |
|---|---|
| After planning | Task decomposition makes sense, dependencies are correct |
| After Phase 1 (scaffold) | Build works, tests pass, project structure is sane |
| At every phase boundary | Phase deliverables match spec, no drift from requirements |
| When agent signals STUCK | Understand the blocker, clarify or unblock |
| Before merging to main | Final quality pass on the complete work |
Optional Review Points
| Moment | Why |
|---|---|
| Every 3-5 iterations | Spot drift early before it compounds |
| After complex tasks | Tasks involving algorithms, business logic, or external APIs |
| When you're curious | Honestly, watching agents work is fascinating and educational |
When NOT to Review
- Every single iteration (you'll burn out and slow the process)
- Simple scaffolding tasks (directory creation, config files)
- Tasks where the test suite provides sufficient verification
What to Look For
1. Correctness — Does it do what the spec says?
Check against acceptance criteria:
# Spec says:
FR-003: Parse QFX files and extract: date, amount, payee, memo, type
# Review:
- Does the parser handle all five fields? ✅
- Does it handle missing memo fields? (spec says optional) ✅
- Does it handle negative amounts correctly? 🤔 (not specified — add to spec)
Run the code yourself:
# Don't just read the code — actually run it
node build/cli.js templates list
node build/cli.js auth
npm test
Agent code that passes its own tests might still not do what YOU expected.
2. Architecture — Does the structure make sense?
Check for:
- Are files in the right places? (following the spec's project structure)
- Are dependencies flowing the right direction? (no circular imports)
- Is shared code actually shared? (no copy-paste between packages)
- Are abstractions appropriate? (not too many layers, not too few)
Red flags:
- A 500-line function (should be decomposed)
- Utility functions scattered across packages (should be in shared/)
- Direct database access from API handlers (should go through a service layer)
- Hardcoded values that should be configurable
3. Quality — Is it well-written?
Check for:
- Error handling (what happens when things fail?)
- Edge cases (empty input, null values, large datasets)
- Type safety (any
anytypes sneaking in?) - Naming (do function/variable names make sense?)
- Comments (only where code ISN'T self-explanatory)
Agents tend to:
- Over-comment obvious code (
// increment counterabovecounter++) - Under-handle errors (happy path works, sad path crashes)
- Use generic variable names in complex logic (
data,result,item) - Create unnecessary abstractions to seem "architecturally sound"
4. Tests — Are they actually testing anything?
The worst agent habit: Tests that always pass because they don't test anything meaningful.
// BAD — this tests nothing
test('parser works', () => {
const result = parse(sampleData);
expect(result).toBeDefined(); // Always passes if parse returns anything
});
// GOOD — this tests behavior
test('parser extracts all QFX fields', () => {
const result = parse(sampleQFX);
expect(result[0].date).toBe('2024-01-15');
expect(result[0].amount).toBe(-42.50);
expect(result[0].payee).toBe('COSTCO WHOLESALE');
expect(result[0].memo).toBe('Purchase');
expect(result[0].type).toBe('debit');
});
Check for:
- Tests with meaningful assertions (not just
toBeDefined()) - Edge case tests (empty input, invalid data, boundary values)
- Error path tests (what happens with bad input?)
- Integration tests (do components work together?)
- Test data that's realistic (not
"test","foo","bar") - Net test count increased — if the agent added a feature and
Tests-Added: 0, that's a failure
4a. TypeScript Hygiene — Did it actually compile?
Runtime tests and TypeScript compilation test different things. An agent can pass all tests while introducing type errors that only surface at build time or when another component uses the broken type.
Always verify:
# Backend
npx tsc --noEmit
# Frontend (if exists)
cd frontend && npx tsc --noEmit
Red flags in the diff:
- New fields used in code but not added to TypeScript interfaces
as unknown as Record<string, unknown>casts to bypass type checkinganytype appearing in new code- Interface definitions missing fields that the API/DB actually returns
5. Drift — Is it still aligned with the spec?
Over many iterations, agents can drift:
- Adding features not in the spec (scope creep)
- Changing patterns established in earlier iterations
- Ignoring constraints listed in the spec
- "Improving" things that were working fine
Drift detection:
# Check what changed
git diff main..agent-branch --stat
# Look for unexpected file changes
# If the agent modified files outside the current task scope, investigate
How to Course-Correct
Light Touch: Notes in the Plan
For minor issues, add notes to IMPLEMENTATION_PLAN.md:
- [ ] Implement transaction query filters
> AGENT NOTE: Use parameterized SQL queries. Previous iteration used
> string concatenation — revert that pattern. See FR-003 acceptance criteria.
Medium Touch: Spec Clarification
If the agent is misunderstanding a requirement, clarify the spec:
# Before (ambiguous)
FR-005: Handle errors gracefully
# After (clear)
FR-005: Error Handling
- [ ] Expired token → auto-refresh and retry (no user action needed)
- [ ] Network failure → show "Cannot reach DocuSign API. Check your connection."
- [ ] Invalid input → show specific field that failed and why
- [ ] Never show raw stack traces to the user
Heavy Touch: Reset and Redirect
If the agent has gone significantly off-track:
# Find the last good state
git log --oneline
# Reset to it
git reset --hard <last-good-commit>
# Update the plan to reflect the reset
# Add notes explaining what went wrong
Then update the spec or plan to prevent the same drift.
Nuclear Option: Rewrite the Task
Sometimes a task is fundamentally misconceived:
# Remove the bad task
- [~] ~~Build real-time WebSocket dashboard~~ (REMOVED: overkill for MVP)
# Replace with something appropriate
- [ ] Build static HTML dashboard that refreshes on page load
Review Checklist Template
Use this for phase boundary reviews:
# Review: Phase [N] — [Name]
**Date:** YYYY-MM-DD
**Iterations completed:** X
**Tasks completed:** Y/Z
## Spec Alignment
- [ ] All phase acceptance criteria met
- [ ] No features added beyond spec
- [ ] Constraints respected (MUST/MUST NOT)
## Code Quality
- [ ] Build passes with zero errors
- [ ] All tests pass
- [ ] TypeScript compiles clean — `npx tsc --noEmit` (backend AND frontend)
- [ ] New code has new tests (Tests-Added > 0 for all feature commits)
- [ ] No obvious code smells
- [ ] Error handling present for failure paths
- [ ] No hardcoded secrets or credentials
- [ ] Shared logic is extracted — no duplication across components
## Architecture
- [ ] File structure matches spec layout
- [ ] Shared code is actually shared (not duplicated)
- [ ] Dependencies flow in one direction
- [ ] No unnecessary abstractions
- [ ] TypeScript interfaces match actual API/DB fields
## Tests
- [ ] Tests assert meaningful behavior (not just "it exists")
- [ ] Edge cases covered
- [ ] Error paths tested
- [ ] Test data is realistic
- [ ] Net test count increased from previous phase
## Model Attribution
- [ ] All commits include `Agent:` trailer
- [ ] All commits include `Tests:` and `Tests-Added:` trailers
- [ ] All commits include `TypeScript:` trailer
- [ ] Run `npm run model-report` (if available) and review output
## Performance
- [ ] Meets NFR thresholds (if specified)
- [ ] No obvious performance issues (N+1 queries, unnecessary loops)
## Issues Found
1. [Issue description] → [Resolution: fix now / add to plan / defer]
2. ...
## Verdict
- [ ] Approved for next phase
- [ ] Needs fixes (list above, re-review after)
- [ ] Needs significant rework (reset to [commit])
The Learning Loop
Every review teaches you something:
- About the agent: What patterns it follows well, where it struggles
- About your spec: What was clear enough, what caused confusion
- About your process: What review cadence works, what to check
Capture these lessons:
# Lessons Learned
## Spec Writing
- Agents interpret "handle errors" as "catch and log."
Must specify WHAT to show the user.
- Including example I/O prevents 90% of format misunderstandings.
## Agent Behavior
- Tends to over-abstract on iteration 1 (creates interfaces before implementations)
- Reliable at following MUST/MUST NOT constraints
- Struggles with tasks that require external API knowledge (add docs links)
## Process
- Reviewing every 5 iterations is about right for this project size
- Phase boundary reviews catch 80% of drift
- Running the code myself (not just reading tests) catches the other 20%
Feed these lessons back into your spec templates and interview protocol.
Review is not overhead. Review is the product. The code is just the artifact.