agent-harness/CHANGELOG.md

183 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Changelog
All notable changes to the Agent Harness project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
---
## [2.0.0] - 2026-04-01
### The Wave-Based Management Release
Patterns developed during the Fintrove project (2026-03-31 → 2026-04-01):
4 waves, 11 streams, 44 tasks, 1,254 → 1,597 tests, zero regressions.
The key insight: **the harness was missing a planning artifact between "the spec" and "the task."**
The execution board fills that gap — a stream-level plan written entirely before any code is written.
### Added
#### New Templates
- **EXECUTION-BOARD-TEMPLATE.md** — Pre-implementation planning artifact for a stream. Defines ALL packets (goal, steps, files, known-answer tests, acceptance criteria) before any code is written. The board is the contract.
- **VALIDATION-TEMPLATE.md** — Per-packet evidence file. Written immediately after each packet completes. Records: test count delta, known-answer test results, acceptance criteria pass/fail.
- **PROCESS-EVAL-TEMPLATE.md** — Stream retrospective written after merge. Covers task sizing accuracy, test-first compliance, known-answer coverage, architecture integrity, model attribution.
#### New Guide
- **WAVE-BASED-MANAGEMENT.md** — Complete guide to the wave/stream/packet hierarchy. The plan-then-implement discipline, execution boards, known-answer tests, EXECUTION_MASTER.md pattern, wave gates, file organization.
### New Patterns Documented
#### The Plan-Then-Implement Discipline
Before writing any implementation code for a stream:
1. Write the execution board (all packets, all acceptance criteria, known-answer tests)
2. Only then: start coding
#### Known-Answer Tests
For domain-specific calculations, every module must include ≥1 test citing an official source:
```typescript
test('CPP at 70 is exactly 42% more than at 65', () => {
// Source: ESDC https://www.canada.ca/en/services/benefits/publicpensions/cpp/benefit-amount.html
expect(at70 / at65).toBeCloseTo(1.42, 5);
});
```
#### Wave Gates
Explicit checklist before Wave N+1: all streams merged, domain accuracy suite passing, process evals written, human sign-off.
#### EXECUTION_MASTER.md Pattern
Project-level dashboard: wave status, active streams, blockers, parallelism rules.
### Metrics (Fintrove, 2026-04-01)
- Waves: 4 | Streams: 11 | Tasks: 44/44
- Test growth: 1,254 → 1,597 (+343) | Regressions: 0
---
## [1.0.0] - 2024-03-18
### Added
#### Core Templates
- **AGENT-INSTRUCTIONS.md** — The agent's system prompt defining the core loop: Orient → Plan → Pick ONE task → Implement → Verify → Commit → Exit
- **PROJECT-SPEC.md** — Comprehensive template for defining projects with sections for overview, tech stack, requirements with acceptance criteria, data models, API design, constraints, phasing, and anti-patterns
- **DECISIONS.md** — Architecture Decision Record (ADR) template for documenting non-obvious technical choices and preventing agent drift
- **ralph-loop.sh** — Bash script implementing the Ralph Wiggum loop pattern: spawns fresh agent instances, checks for completion signals, restarts until done
#### Process Guides
- **SPEC-CREATION-GUIDE.md** — Complete interview protocol for creating high-quality specifications through structured conversation between human and agent. Covers vision, requirements extraction, technical discovery, constraint mapping, and spec assembly
- **PLAN-MANAGEMENT.md** — Guide for managing IMPLEMENTATION_PLAN.md as a living document. Covers task decomposition patterns, intervention strategies, progress tracking, and plan anti-patterns
- **REVIEW-AND-QA.md** — Framework for evaluating agent output. Includes review timing, quality checklists, drift detection, course-correction strategies, and review templates
- **COST-OPTIMIZATION.md** — Comprehensive guide to model billing (request-based vs token-based), optimal strategies per provider, model selection, context management, and the hybrid approach
- **OPENCLAW-INTEGRATION.md** — Running the harness in OpenClaw with sessions_spawn, cron jobs, and shell scripts. Covers model selection, monitoring, and OpenClaw-specific agent instructions
- **TROUBLESHOOTING.md** — Failure taxonomy covering five common failure modes (stuck loop, drift, overengineering, test theater, context overflow) with root causes and recovery steps
- **TUTORIAL.md** — Complete 30-minute walkthrough building a markdown link checker CLI tool from zero using the harness. Concrete, copy-pasteable example demonstrating the entire workflow
#### Examples & Documentation
- **EXAMPLES.md** — Worked example of a Fintrove-style personal finance app with complete PROJECT-SPEC.md. Compares three approaches (Ezward, Ralph Wiggum, Nate Jones) and provides best practices
- **README.md** — Project overview with file index, quick start guide, and core insights
- **PARALLEL-AGENTS.md** — Guide for running multiple agents simultaneously on independent tasks, covering parallelization strategies, work splitting, result merging, and conflict resolution
### Features
#### The Core Loop Pattern
- Stateless iteration model: each agent starts fresh with clean context
- Orient phase: agent reads spec, plan, and git history
- Single-task focus: agents complete ONE task per iteration
- Mandatory verification: build and test must pass before commit
- Promise-based signaling: `<promise>PLANNED|DONE|STUCK|ERROR</promise>`
#### Interview Protocol
- Five-phase structured interview for spec creation
- Domain knowledge extraction techniques
- Technical discovery patterns
- Constraint mapping (MUST/MUST NOT/PREFER)
- Spec quality checklist
#### Plan Management Patterns
- Scaffold-first pattern
- Vertical slice pattern
- Test-first pattern
- Dependency chain pattern
- Human intervention mechanisms (notes, task splitting, reprioritization)
#### Cost Optimization Strategies
- Request-based optimization (batch tasks, compound requests)
- Token-based optimization (fresh sub-agents, minimal context)
- Model selection by task complexity
- Hybrid strategy using multiple subscriptions
- Usage monitoring and budget allocation
#### OpenClaw Integration
- Manual orchestration via sessions_spawn
- Cron-based automation for overnight work
- Shell script orchestration
- Model selection per iteration
- Sub-agent monitoring and session history
#### Troubleshooting Framework
- Stuck loop detection and resolution
- Architecture drift prevention with ADRs
- Overengineering constraints
- Test quality validation
- Context overflow mitigation
### Documentation Quality Standards
- Comprehensive examples with real code
- Anti-pattern documentation
- Copy-pasteable templates
- Concrete acceptance criteria
- Decision record patterns
### Supported Agents
- Claude CLI (via ralph-loop.sh)
- OpenAI Codex CLI (via ralph-loop.sh)
- OpenClaw sessions_spawn (any model)
- Extensible to other agent frameworks
### Supported Workflows
- CLI loop (ralph-loop.sh)
- OpenClaw manual orchestration
- OpenClaw cron automation
- Hybrid approaches
---
## [Unreleased]
### Planned
- Additional language-specific examples (Python, Go, Rust)
- Integration templates for common CI/CD systems
- Cost calculator tool (estimate iterations × model cost)
- Spec validator (check completeness before starting)
- Template variations for different project types (API, CLI, library, web app)
---
## Version History Summary
- **1.0.0** (2024-03-18) — Initial release with complete harness system: core templates, process guides, examples, and multi-platform support
---
## Contributing
This harness is a living system. If you:
- Discover new failure modes
- Develop better patterns
- Find gaps in the guides
- Create examples for other project types
Please document them and contribute back. The harness improves as we learn what works.
---
## License
This project is released into the public domain. Use it, modify it, share it. No attribution required.
---
_The harness is 1.0 because it works. It's not 2.0 yet because we're still learning how to use it better._