10 KiB

Raw Permalink Blame History

Review & QA — Evaluating Agent Output

Agents are fast but not infallible. Your role shifts from "writer" to "reviewer." This guide teaches you when to check in, what to look for, and how to course-correct.

The New Developer Workflow

Traditional development:

Think → Write Code → Test → Debug → Commit

Agent-assisted development:

Spec → Agent Builds → You Review → Course-Correct → Agent Continues

Your job is no longer writing code. It's:

Defining what to build (the spec)
Reviewing what the agent built
Course-correcting when it drifts

This is a fundamentally different skill. Most developers struggle with it at first because reviewing feels passive. It's not — it's the highest-leverage activity in the loop.

When to Review

Mandatory Review Points

Moment	What to check
After planning	Task decomposition makes sense, dependencies are correct
After Phase 1 (scaffold)	Build works, tests pass, project structure is sane
At every phase boundary	Phase deliverables match spec, no drift from requirements
When agent signals STUCK	Understand the blocker, clarify or unblock
Before merging to main	Final quality pass on the complete work

Optional Review Points

Moment	Why
Every 3-5 iterations	Spot drift early before it compounds
After complex tasks	Tasks involving algorithms, business logic, or external APIs
When you're curious	Honestly, watching agents work is fascinating and educational

When NOT to Review

Every single iteration (you'll burn out and slow the process)
Simple scaffolding tasks (directory creation, config files)
Tasks where the test suite provides sufficient verification

What to Look For

1. Correctness — Does it do what the spec says?

Check against acceptance criteria:

# Spec says:
FR-003: Parse QFX files and extract: date, amount, payee, memo, type

# Review:
- Does the parser handle all five fields? ✅
- Does it handle missing memo fields? (spec says optional) ✅
- Does it handle negative amounts correctly? 🤔 (not specified — add to spec)

Run the code yourself:

# Don't just read the code — actually run it
node build/cli.js templates list
node build/cli.js auth
npm test

Agent code that passes its own tests might still not do what YOU expected.

2. Architecture — Does the structure make sense?

Check for:

Are files in the right places? (following the spec's project structure)
Are dependencies flowing the right direction? (no circular imports)
Is shared code actually shared? (no copy-paste between packages)
Are abstractions appropriate? (not too many layers, not too few)

Red flags:

A 500-line function (should be decomposed)
Utility functions scattered across packages (should be in shared/)
Direct database access from API handlers (should go through a service layer)
Hardcoded values that should be configurable

3. Quality — Is it well-written?

Check for:

Error handling (what happens when things fail?)
Edge cases (empty input, null values, large datasets)
Type safety (any any types sneaking in?)
Naming (do function/variable names make sense?)
Comments (only where code ISN'T self-explanatory)

Agents tend to:

Over-comment obvious code (// increment counter above counter++)
Under-handle errors (happy path works, sad path crashes)
Use generic variable names in complex logic (data, result, item)
Create unnecessary abstractions to seem "architecturally sound"

4. Tests — Are they actually testing anything?

The worst agent habit: Tests that always pass because they don't test anything meaningful.

// BAD — this tests nothing
test('parser works', () => {
  const result = parse(sampleData);
  expect(result).toBeDefined();  // Always passes if parse returns anything
});

// GOOD — this tests behavior
test('parser extracts all QFX fields', () => {
  const result = parse(sampleQFX);
  expect(result[0].date).toBe('2024-01-15');
  expect(result[0].amount).toBe(-42.50);
  expect(result[0].payee).toBe('COSTCO WHOLESALE');
  expect(result[0].memo).toBe('Purchase');
  expect(result[0].type).toBe('debit');
});

Check for:

Tests with meaningful assertions (not just toBeDefined())
Edge case tests (empty input, invalid data, boundary values)
Error path tests (what happens with bad input?)
Integration tests (do components work together?)
Test data that's realistic (not "test", "foo", "bar")
Net test count increased — if the agent added a feature and Tests-Added: 0, that's a failure

4a. TypeScript Hygiene — Did it actually compile?

Runtime tests and TypeScript compilation test different things. An agent can pass all tests while introducing type errors that only surface at build time or when another component uses the broken type.

Always verify:

# Backend
npx tsc --noEmit

# Frontend (if exists)
cd frontend && npx tsc --noEmit

Red flags in the diff:

New fields used in code but not added to TypeScript interfaces
as unknown as Record<string, unknown> casts to bypass type checking
any type appearing in new code
Interface definitions missing fields that the API/DB actually returns

5. Drift — Is it still aligned with the spec?

Over many iterations, agents can drift:

Adding features not in the spec (scope creep)
Changing patterns established in earlier iterations
Ignoring constraints listed in the spec
"Improving" things that were working fine

Drift detection:

# Check what changed
git diff main..agent-branch --stat

# Look for unexpected file changes
# If the agent modified files outside the current task scope, investigate

How to Course-Correct

Light Touch: Notes in the Plan

For minor issues, add notes to IMPLEMENTATION_PLAN.md:

- [ ] Implement transaction query filters
  > AGENT NOTE: Use parameterized SQL queries. Previous iteration used
  > string concatenation — revert that pattern. See FR-003 acceptance criteria.

Medium Touch: Spec Clarification

If the agent is misunderstanding a requirement, clarify the spec:

# Before (ambiguous)
FR-005: Handle errors gracefully

# After (clear)
FR-005: Error Handling
- [ ] Expired token → auto-refresh and retry (no user action needed)
- [ ] Network failure → show "Cannot reach DocuSign API. Check your connection."
- [ ] Invalid input → show specific field that failed and why
- [ ] Never show raw stack traces to the user

Heavy Touch: Reset and Redirect

If the agent has gone significantly off-track:

# Find the last good state
git log --oneline

# Reset to it
git reset --hard <last-good-commit>

# Update the plan to reflect the reset
# Add notes explaining what went wrong

Then update the spec or plan to prevent the same drift.

Nuclear Option: Rewrite the Task

Sometimes a task is fundamentally misconceived:

# Remove the bad task
- [~] ~~Build real-time WebSocket dashboard~~ (REMOVED: overkill for MVP)

# Replace with something appropriate
- [ ] Build static HTML dashboard that refreshes on page load

Review Checklist Template

Use this for phase boundary reviews:

# Review: Phase [N] — [Name]
**Date:** YYYY-MM-DD
**Iterations completed:** X
**Tasks completed:** Y/Z

## Spec Alignment
- [ ] All phase acceptance criteria met
- [ ] No features added beyond spec
- [ ] Constraints respected (MUST/MUST NOT)

## Code Quality
- [ ] Build passes with zero errors
- [ ] All tests pass
- [ ] TypeScript compiles clean — `npx tsc --noEmit` (backend AND frontend)
- [ ] New code has new tests (Tests-Added > 0 for all feature commits)
- [ ] No obvious code smells
- [ ] Error handling present for failure paths
- [ ] No hardcoded secrets or credentials
- [ ] Shared logic is extracted — no duplication across components

## Architecture
- [ ] File structure matches spec layout
- [ ] Shared code is actually shared (not duplicated)
- [ ] Dependencies flow in one direction
- [ ] No unnecessary abstractions
- [ ] TypeScript interfaces match actual API/DB fields

## Tests
- [ ] Tests assert meaningful behavior (not just "it exists")
- [ ] Edge cases covered
- [ ] Error paths tested
- [ ] Test data is realistic
- [ ] Net test count increased from previous phase

## Model Attribution
- [ ] All commits include `Agent:` trailer
- [ ] All commits include `Tests:` and `Tests-Added:` trailers
- [ ] All commits include `TypeScript:` trailer
- [ ] Run `npm run model-report` (if available) and review output

## Performance
- [ ] Meets NFR thresholds (if specified)
- [ ] No obvious performance issues (N+1 queries, unnecessary loops)

## Issues Found
1. [Issue description] → [Resolution: fix now / add to plan / defer]
2. ...

## Verdict
- [ ] Approved for next phase
- [ ] Needs fixes (list above, re-review after)
- [ ] Needs significant rework (reset to [commit])

The Learning Loop

Every review teaches you something:

About the agent: What patterns it follows well, where it struggles
About your spec: What was clear enough, what caused confusion
About your process: What review cadence works, what to check

Capture these lessons:

# Lessons Learned

## Spec Writing
- Agents interpret "handle errors" as "catch and log." 
  Must specify WHAT to show the user.
- Including example I/O prevents 90% of format misunderstandings.

## Agent Behavior
- Tends to over-abstract on iteration 1 (creates interfaces before implementations)
- Reliable at following MUST/MUST NOT constraints
- Struggles with tasks that require external API knowledge (add docs links)

## Process
- Reviewing every 5 iterations is about right for this project size
- Phase boundary reviews catch 80% of drift
- Running the code myself (not just reading tests) catches the other 20%

Feed these lessons back into your spec templates and interview protocol.

Review is not overhead. Review is the product. The code is just the artifact.

10 KiB Raw Permalink Blame History

Review & QA — Evaluating Agent Output

The New Developer Workflow

When to Review

Mandatory Review Points

Optional Review Points

When NOT to Review

What to Look For

1. Correctness — Does it do what the spec says?

2. Architecture — Does the structure make sense?

3. Quality — Is it well-written?

4. Tests — Are they actually testing anything?

4a. TypeScript Hygiene — Did it actually compile?

5. Drift — Is it still aligned with the spec?

How to Course-Correct

Light Touch: Notes in the Plan

Medium Touch: Spec Clarification

Heavy Touch: Reset and Redirect

Nuclear Option: Rewrite the Task

Review Checklist Template

The Learning Loop

10 KiB

Raw Permalink Blame History