agent-harness/CHANGELOG.md

# Changelog

All notable changes to the Agent Harness project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

---

## [2.0.0] - 2026-04-01

### The Wave-Based Management Release

Patterns developed during the Fintrove project (2026-03-31 → 2026-04-01):
4 waves, 11 streams, 44 tasks, 1,254 → 1,597 tests, zero regressions.

The key insight: **the harness was missing a planning artifact between "the spec" and "the task."**
The execution board fills that gap — a stream-level plan written entirely before any code is written.

### Added

#### New Templates
- **EXECUTION-BOARD-TEMPLATE.md** — Pre-implementation planning artifact for a stream. Defines ALL packets (goal, steps, files, known-answer tests, acceptance criteria) before any code is written. The board is the contract.
- **VALIDATION-TEMPLATE.md** — Per-packet evidence file. Written immediately after each packet completes. Records: test count delta, known-answer test results, acceptance criteria pass/fail.
- **PROCESS-EVAL-TEMPLATE.md** — Stream retrospective written after merge. Covers task sizing accuracy, test-first compliance, known-answer coverage, architecture integrity, model attribution.

#### New Guide
- **WAVE-BASED-MANAGEMENT.md** — Complete guide to the wave/stream/packet hierarchy. The plan-then-implement discipline, execution boards, known-answer tests, EXECUTION_MASTER.md pattern, wave gates, file organization.

### New Patterns Documented

#### The Plan-Then-Implement Discipline
Before writing any implementation code for a stream:
1. Write the execution board (all packets, all acceptance criteria, known-answer tests)
2. Only then: start coding

#### Known-Answer Tests
For domain-specific calculations, every module must include ≥1 test citing an official source:
```typescript
test('CPP at 70 is exactly 42% more than at 65', () => {
  // Source: ESDC https://www.canada.ca/en/services/benefits/publicpensions/cpp/benefit-amount.html
  expect(at70 / at65).toBeCloseTo(1.42, 5);
});
```

#### Wave Gates
Explicit checklist before Wave N+1: all streams merged, domain accuracy suite passing, process evals written, human sign-off.

#### EXECUTION_MASTER.md Pattern
Project-level dashboard: wave status, active streams, blockers, parallelism rules.

### Metrics (Fintrove, 2026-04-01)
- Waves: 4 | Streams: 11 | Tasks: 44/44
- Test growth: 1,254 → 1,597 (+343) | Regressions: 0

---

## [1.0.0] - 2024-03-18

### Added

#### Core Templates
- **AGENT-INSTRUCTIONS.md** — The agent's system prompt defining the core loop: Orient → Plan → Pick ONE task → Implement → Verify → Commit → Exit
- **PROJECT-SPEC.md** — Comprehensive template for defining projects with sections for overview, tech stack, requirements with acceptance criteria, data models, API design, constraints, phasing, and anti-patterns
- **DECISIONS.md** — Architecture Decision Record (ADR) template for documenting non-obvious technical choices and preventing agent drift
- **ralph-loop.sh** — Bash script implementing the Ralph Wiggum loop pattern: spawns fresh agent instances, checks for completion signals, restarts until done

#### Process Guides
- **SPEC-CREATION-GUIDE.md** — Complete interview protocol for creating high-quality specifications through structured conversation between human and agent. Covers vision, requirements extraction, technical discovery, constraint mapping, and spec assembly
- **PLAN-MANAGEMENT.md** — Guide for managing IMPLEMENTATION_PLAN.md as a living document. Covers task decomposition patterns, intervention strategies, progress tracking, and plan anti-patterns
- **REVIEW-AND-QA.md** — Framework for evaluating agent output. Includes review timing, quality checklists, drift detection, course-correction strategies, and review templates
- **COST-OPTIMIZATION.md** — Comprehensive guide to model billing (request-based vs token-based), optimal strategies per provider, model selection, context management, and the hybrid approach
- **OPENCLAW-INTEGRATION.md** — Running the harness in OpenClaw with sessions_spawn, cron jobs, and shell scripts. Covers model selection, monitoring, and OpenClaw-specific agent instructions
- **TROUBLESHOOTING.md** — Failure taxonomy covering five common failure modes (stuck loop, drift, overengineering, test theater, context overflow) with root causes and recovery steps
- **TUTORIAL.md** — Complete 30-minute walkthrough building a markdown link checker CLI tool from zero using the harness. Concrete, copy-pasteable example demonstrating the entire workflow

#### Examples & Documentation
- **EXAMPLES.md** — Worked example of a Fintrove-style personal finance app with complete PROJECT-SPEC.md. Compares three approaches (Ezward, Ralph Wiggum, Nate Jones) and provides best practices
- **README.md** — Project overview with file index, quick start guide, and core insights
- **PARALLEL-AGENTS.md** — Guide for running multiple agents simultaneously on independent tasks, covering parallelization strategies, work splitting, result merging, and conflict resolution

### Features

#### The Core Loop Pattern
- Stateless iteration model: each agent starts fresh with clean context
- Orient phase: agent reads spec, plan, and git history
- Single-task focus: agents complete ONE task per iteration
- Mandatory verification: build and test must pass before commit
- Promise-based signaling: `<promise>PLANNED|DONE|STUCK|ERROR</promise>`

#### Interview Protocol
- Five-phase structured interview for spec creation
- Domain knowledge extraction techniques
- Technical discovery patterns
- Constraint mapping (MUST/MUST NOT/PREFER)
- Spec quality checklist

#### Plan Management Patterns
- Scaffold-first pattern
- Vertical slice pattern
- Test-first pattern
- Dependency chain pattern
- Human intervention mechanisms (notes, task splitting, reprioritization)

#### Cost Optimization Strategies
- Request-based optimization (batch tasks, compound requests)
- Token-based optimization (fresh sub-agents, minimal context)
- Model selection by task complexity
- Hybrid strategy using multiple subscriptions
- Usage monitoring and budget allocation

#### OpenClaw Integration
- Manual orchestration via sessions_spawn
- Cron-based automation for overnight work
- Shell script orchestration
- Model selection per iteration
- Sub-agent monitoring and session history

#### Troubleshooting Framework
- Stuck loop detection and resolution
- Architecture drift prevention with ADRs
- Overengineering constraints
- Test quality validation
- Context overflow mitigation

### Documentation Quality Standards
- Comprehensive examples with real code
- Anti-pattern documentation
- Copy-pasteable templates
- Concrete acceptance criteria
- Decision record patterns

### Supported Agents
- Claude CLI (via ralph-loop.sh)
- OpenAI Codex CLI (via ralph-loop.sh)
- OpenClaw sessions_spawn (any model)
- Extensible to other agent frameworks

### Supported Workflows
- CLI loop (ralph-loop.sh)
- OpenClaw manual orchestration
- OpenClaw cron automation
- Hybrid approaches

---

## [Unreleased]

### Planned
- Additional language-specific examples (Python, Go, Rust)
- Integration templates for common CI/CD systems
- Cost calculator tool (estimate iterations × model cost)
- Spec validator (check completeness before starting)
- Template variations for different project types (API, CLI, library, web app)

---

## Version History Summary

- **1.0.0** (2024-03-18) — Initial release with complete harness system: core templates, process guides, examples, and multi-platform support

---

## Contributing

This harness is a living system. If you:
- Discover new failure modes
- Develop better patterns
- Find gaps in the guides
- Create examples for other project types

Please document them and contribute back. The harness improves as we learn what works.

---

## License

This project is released into the public domain. Use it, modify it, share it. No attribution required.

---

_The harness is 1.0 because it works. It's not 2.0 yet because we're still learning how to use it better._