agent-harness/WAVE-BASED-MANAGEMENT.md

269 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Wave-Based Project Management
> The biggest gap in most agentic projects: **planning only one task at a time.**
> This guide captures the wave-based approach — planning a full stream's worth of work
> before writing a single line of implementation code.
>
> Proven in practice: 44 tasks across 4 waves, 1,254 → 1,597 tests, zero regressions.
---
## The Core Insight: Plan the Stream, Not the Task
The basic harness has you plan one task at a time. This works for small projects.
For larger projects, it creates problems:
- **Scope drift:** Agent picks up the next task without understanding how it fits the stream
- **Missing dependencies:** Packet 3 turns out to need something Packet 1 should have built
- **Unknown-answer tests discovered too late:** Financial formulas validated by feel, not by known CRA/ESDC figures
- **No clear "done":** What does stream completion actually mean?
The solution: **write the entire execution board for a stream before implementing any of it.**
```
❌ Old approach:
Plan task → Implement task → Plan next task → Implement → ...
✅ Wave approach:
Plan ENTIRE stream → Review plan → Implement packet-by-packet → Close stream
```
---
## The Four Levels of Structure
```
Project
└── Waves (groups of streams, sequenced by dependency)
└── Streams (a feature or module — has its own branch)
└── Packets (atomic unit of work — one commit per packet)
└── Tasks (sub-steps within a packet)
```
### Waves
A wave is a set of streams that logically belong together and can be started in parallel (or have light dependencies between them). Waves are gated — Wave N+1 doesn't start until Wave N is fully merged and green.
**Example:**
- Wave 1: Core data models + calculation engines (everything else depends on this)
- Wave 2: Advisory layer + specialized tools (uses Wave 1 outputs)
- Wave 3: Infrastructure + integrations (can be parallel with Wave 2)
- Wave 4: Future vision / stretch goals
### Streams
A stream is a feature branch with a defined scope. It has:
- One `execution-board.md` (written before any code)
- 26 packets
- One `process-eval.md` (written after merge)
- Validation evidence per packet
### Packets
A packet is the atomic unit — one focused chunk of work that produces a commit. It has:
- A clear goal (one sentence)
- Explicit steps
- Known-answer tests (mandatory for calculation work)
- Programmatically verifiable acceptance criteria
- One validation evidence file
---
## The Execution Board: Your Planning Artifact
The execution board lives at `.harness/<stream>/execution-board.md`.
Copy `EXECUTION-BOARD-TEMPLATE.md` and fill it in.
**The rule:** The board must be complete before you write a single line of implementation.
### What "complete" means:
- Every packet is defined with goal, steps, files, and acceptance criteria
- Known-answer tests are written out (not "TBD") for any calculation
- Dependency order between packets is explicit
- Stream completion criteria are listed
### What happens if you skip it:
- You discover mid-stream that Packet 3 needs something Packet 1 didn't build
- You commit calculation code with no ground-truth validation
- You have no clear definition of "done" for the stream
- The next agent session doesn't know what state the stream is in
---
## Known-Answer Tests: The Most Important Rule
For any stream that touches domain-specific calculations (financial math, scientific formulas, regulatory thresholds, physical constants), every calculation module **must** include at least one known-answer test citing an official source.
```typescript
// ✅ Correct: cites official source, tests exact value
test('CPP at 70 is exactly 42% more than at 65', () => {
// Source: ESDC https://www.canada.ca/en/services/benefits/publicpensions/cpp/benefit-amount.html
// Formula: +0.7% per month after 65 × 60 months = +42%
expect(calculateCPPBenefitAtAge(1000, 70) / calculateCPPBenefitAtAge(1000, 65)).toBeCloseTo(1.42, 5);
});
// ❌ Wrong: no source, tests implementation against itself
test('CPP at 70 returns more than at 65', () => {
expect(calculateCPPBenefitAtAge(1000, 70)).toBeGreaterThan(calculateCPPBenefitAtAge(1000, 65));
});
```
**Why this matters:** An agent can write a plausible-looking formula that's subtly wrong. Without a known-answer test from an authoritative source, you won't catch it until someone gets incorrect results in production. With known-answer tests, errors are caught immediately.
### What qualifies as a "known-answer source":
- Government publications (CRA, ESDC, IRS, HMRC, etc.)
- Official standards documents (ISO, RFC, IEEE)
- Published academic results
- Regulatory filings with specific numerical requirements
- Product specifications with exact values
### The financial accuracy eval pattern
For financial software, create a separate calibration test suite that lives outside the normal unit tests:
```
evals/
└── code-quality/
└── financial-accuracy.test.ts ← Run with: npm run eval:financial-accuracy
```
This suite contains ONLY known-answer tests from official sources. It grows over time as you add calculation modules. Run it independently to verify the app's financial accuracy hasn't drifted.
---
## EXECUTION_MASTER.md: The Project Dashboard
Every project using wave-based management should have a single coordination file — typically `EXECUTION_MASTER.md` or equivalent — that shows:
```markdown
# Project Execution Master
## Wave Status
| Wave | Description | Status |
|------|-------------|--------|
| Wave 1 | Core foundations | ✅ Complete |
| Wave 2 | Advisory layer | 🟡 In progress |
| Wave 3 | Infrastructure | ⏸️ Not started |
## Active Streams
| Stream | Branch | Status | Blocker |
|--------|--------|--------|---------|
| cpp-optimizer | feat/cpp-optimizer | ✅ Merged | — |
| rrsp-meltdown | feat/rrsp-meltdown | 🟠 In progress | — |
| estate-planning | feat/estate-planning | ⏸️ Planned | Needs rrsp-meltdown |
## Parallelism Rules
1. Max 2 active streams simultaneously
2. Shared schema changes are always sequential
3. Integration gate before any merge: full test suite must stay green
```
**Every agent session starts by reading this file.** It immediately knows:
- What wave is active
- Which streams are running
- What's blocked and why
- What can run in parallel
---
## The Wave Gate
Before starting Wave N+1, verify:
```
[ ] All streams in Wave N merged to main
[ ] Full test suite green (count ≥ baseline)
[ ] Domain-specific accuracy suite passing (if applicable)
[ ] All regression baselines saved
[ ] Process evals written for all Wave N streams
[ ] process-eval-history.json updated
[ ] IMPLEMENTATION_PLAN: all Wave N tasks marked [x]
[ ] EXECUTION_MASTER: Wave N status updated to ✅
[ ] Human sign-off: outputs are producing correct/plausible results
```
The gate exists because Wave N+1 often builds on Wave N's outputs. If Wave N has silent bugs, they compound in Wave N+1. Catch them at the gate.
---
## File Organization
```
<project-root>/
├── AGENT.md ← Agent instructions (adapted from AGENT-INSTRUCTIONS.md)
├── IMPLEMENTATION_PLAN.md ← Master backlog (tasks 1-N, all waves)
├── PROJECT-SPEC.md ← What to build (never changes)
├── DECISIONS.md ← Architecture Decision Records
└── .harness/
├── EXECUTION_MASTER.md ← Wave/stream dashboard
├── EXECUTION-BOARD-TEMPLATE.md ← Copy this for new streams
├── VALIDATION-TEMPLATE.md ← Copy this for packet evidence
├── PROCESS-EVAL-TEMPLATE.md ← Copy this for stream retrospectives
├── regression-baselines/ ← Deterministic output snapshots
├── <stream-A>/
│ ├── execution-board.md ← Written BEFORE implementation
│ ├── process-eval.md ← Written AFTER merge
│ └── validation/
│ ├── <XX-01>-validation.md
│ └── <XX-02>-validation.md
└── <stream-B>/
└── ...
```
---
## Adapting for Your Project
### Projects WITHOUT domain-specific calculations
Skip the known-answer tests and financial accuracy eval. Keep everything else.
### Projects with a small scope (< 10 tasks)
Skip waves entirely — just use streams. One execution board per logical feature group.
### Projects with a single developer (no parallelism)
Streams are still valuable for planning discipline even if run sequentially.
### Non-TypeScript / non-test projects
Adapt the commit trailers. The key trackers are:
- **What model did the work** (for attribution and quality tracking)
- **Test counts** or equivalent quality metric
- **Build / type check status**
---
## Quick Reference: The Discipline in One Page
```
BEFORE CODING:
✅ Write execution board for the entire stream
✅ Define known-answer tests for ALL calculation modules
✅ Get acceptance criteria to programmatically verifiable
PER PACKET:
✅ Code + tests in same commit
✅ Full suite green before moving on
✅ Write validation evidence immediately after
✅ Commit trailer: Agent / Tests / Tests-Added / TypeScript
PER STREAM:
✅ Write process eval honestly
✅ Merge with --no-ff
✅ Update EXECUTION_MASTER
PER WAVE:
✅ Run wave gate checklist before starting next wave
✅ Human sign-off on outputs
```
---
## Why This Works
The wave-based approach solves three failure modes common in agent projects:
**1. Scope drift** — The execution board defines the stream's boundaries upfront. Agents can't drift into unrelated work because the plan is explicit.
**2. Hidden inaccuracies** — Known-answer tests with official citations are written in the planning phase, before any implementation. This forces precision in the spec, which translates directly into correct implementations.
**3. No definition of done** — The stream completion criteria (in the execution board) tell every agent, every session: "the stream is done when these boxes are checked." No ambiguity.
---
*This pattern was developed through practice on the Fintrove project (2026-03-31 → 2026-04-01): 4 waves, 11 streams, 44 tasks, 1,254 → 1,597 tests, zero regressions.*