agent-harness/README.md

# Agent Harness Templates

A complete system for running autonomous AI coding agents on complex projects.

## Files

### Core Templates (copy into your project)
| File | Purpose |
|------|---------|
| `AGENT-INSTRUCTIONS.md` | The agent's "system prompt" — reads this every iteration. Defines the core loop, mandatory pre-commit checklist (tests + TypeScript), commit attribution format, Tests-Added rule, and known anti-patterns |
| `PROJECT-SPEC.md` | Template for defining your problem. Sections for: overview, tech stack, requirements with acceptance criteria, data model, API design, constraints, phasing, anti-patterns |
| `DECISIONS.md` | Architecture Decision Record (ADR) template for documenting non-obvious technical choices. Prevents agent drift by creating continuity across fresh contexts |
| `EXECUTION-BOARD-TEMPLATE.md` | **⭐ New.** Pre-implementation planning artifact for a stream. Defines ALL packets, known-answer tests, and acceptance criteria BEFORE any code is written. The core of the plan-then-implement discipline. |
| `VALIDATION-TEMPLATE.md` | **⭐ New.** Per-packet evidence file written after each packet completes. Records test counts, known-answer results, and acceptance criteria tick-off. |
| `PROCESS-EVAL-TEMPLATE.md` | **⭐ New.** Stream retrospective written after merge. Honest assessment of task sizing, test-first compliance, and model quality. |
| `ralph-loop.sh` | The Ralph Wiggum bash loop — spawns fresh agent instances, checks for completion signals, restarts until done. Supports Claude, Codex, Aider, Gemini, and custom agents |
| `model-report.ts` | Parses git log `Agent:` trailers to generate per-model quality table (commits, tests added, TypeScript errors). Copy to `scripts/model-report.ts`, add `"model-report": "ts-node scripts/model-report.ts"` to package.json |

### Process Guides (read before you start)
| File | Purpose |
|------|---------|
| `SPEC-CREATION-GUIDE.md` | **Start here.** How to create a great spec through structured interview. The interview protocol, domain knowledge extraction, and spec quality checklist |
| `TUTORIAL.md` | **Best way to learn.** Complete 30-minute walkthrough building a markdown link checker CLI tool from zero. Concrete, copy-pasteable example of the entire workflow |
| `WAVE-BASED-MANAGEMENT.md` | **⭐ New.** How to structure larger projects into waves, streams, and packets. The plan-then-implement discipline, execution boards, known-answer tests, and wave gates. Essential for projects with 10+ tasks. |
| `PLAN-MANAGEMENT.md` | How the IMPLEMENTATION_PLAN.md works — the living document agents update. Task decomposition patterns, intervention strategies, progress tracking |
| `REVIEW-AND-QA.md` | How to evaluate agent output. When to review, what to look for, how to course-correct. Review checklist template including model attribution and TypeScript hygiene checks |
| `COST-OPTIMIZATION.md` | Getting more work per dollar. Request-based vs token-based billing, optimal strategies per provider, model selection guide, the hybrid strategy, anti-patterns |
| `OPENCLAW-INTEGRATION.md` | Running the harness in OpenClaw with sessions_spawn, cron jobs, and shell scripts. Model selection, monitoring, cost optimization |
| `TROUBLESHOOTING.md` | When things go wrong. The five failure modes (stuck loop, drift, overengineering, test theater, context overflow) and how to fix each |
| `PARALLEL-AGENTS.md` | Running multiple agents simultaneously on independent tasks. When to parallelize, how to split work, how to merge results, conflict resolution, OpenClaw patterns |

### Examples & Reference
| File | Purpose |
|------|---------|
| `EXAMPLES.md` | Worked example: Fintrove-style finance app spec + comparison of three approaches (Ezward, Ralph Wiggum, Nate Jones) |
| `CHANGELOG.md` | Version history and evolution of the agent harness project itself |

## Quick Start

### New to the Harness? (Start Here)
1. **Read** `TUTORIAL.md` — 30-minute hands-on walkthrough building a real CLI tool
2. **Read** `SPEC-CREATION-GUIDE.md` — learn the interview protocol
3. **Try it** — build your own project using the workflow

### Ready to Build? (Simple project, <10 tasks)
1. **Read** `COST-OPTIMIZATION.md` — understand your billing model before you start burning budget
2. **Interview** — work with your agent to create the spec (or do it solo)
3. **Fill out** `PROJECT-SPEC.md` with your problem definition
4. **Copy** `PROJECT-SPEC.md`, `AGENT-INSTRUCTIONS.md`, and `DECISIONS.md` into your project root
5. **Run** `./ralph-loop.sh` (CLI) or use OpenClaw sessions_spawn (see `OPENCLAW-INTEGRATION.md`)
6. **Review** at phase boundaries using `REVIEW-AND-QA.md` checklist
7. **Troubleshoot** failures using `TROUBLESHOOTING.md`

### Building Something Larger? (10+ tasks, multiple features)
1. **Read** `WAVE-BASED-MANAGEMENT.md` — the plan-then-implement discipline
2. **Create** your `IMPLEMENTATION_PLAN.md` with all tasks grouped into waves
3. **Create** `.harness/EXECUTION_MASTER.md` — your wave/stream dashboard
4. **For each stream:** copy `EXECUTION-BOARD-TEMPLATE.md`, fill ALL packets before coding any
5. **After each packet:** copy `VALIDATION-TEMPLATE.md` and fill it in
6. **After each stream:** copy `PROCESS-EVAL-TEMPLATE.md` and write the retrospective
7. **At each wave boundary:** run the wave gate checklist before starting the next wave

## The Core Insight

All successful agent approaches share the same loop:

```
Orient (read spec + plan) → Pick ONE task → Build → Test → Commit → Exit → Restart fresh
```

The spec defines WHAT. The plan tracks WHERE we are. Fresh context each iteration prevents drift. The human reviews and course-corrects.

See each file for detailed instructions.

## When to Use Which Guide

```
┌─────────────────────────────────────────────────┐
│           "Which guide do I need?"               │
├─────────────────────────────────────────────────┤
│                                                  │
│  Just starting?                                  │
│    → TUTORIAL.md (hands-on learning)            │
│                                                  │
│  Creating a spec?                                │
│    → SPEC-CREATION-GUIDE.md (interview)         │
│                                                  │
│  Agent is stuck?                                 │
│    → TROUBLESHOOTING.md (failure modes)         │
│                                                  │
│  Reviewing agent output?                         │
│    → REVIEW-AND-QA.md (what to check)           │
│                                                  │
│  Worried about cost?                             │
│    → COST-OPTIMIZATION.md (billing models)      │
│                                                  │
│  Multiple independent features?                  │
│    → PARALLEL-AGENTS.md (coordination)          │
│                                                  │
│  Using OpenClaw?                                 │
│    → OPENCLAW-INTEGRATION.md (sessions_spawn)   │
│                                                  │
│  Agent keeps changing past decisions?            │
│    → DECISIONS.md (ADR template)                │
│                                                  │
│  Want to see it in action?                       │
│    → EXAMPLES.md (real project example)         │
│                                                  │
└─────────────────────────────────────────────────┘
```

## Philosophy

### Fresh Context > Long Context
Each iteration starts with a fresh agent. No accumulated confusion, no stale reasoning. The git history and plan file provide continuity.

### One Task > Many Tasks
Agents that try to do everything in one session produce spaghetti. Agents that focus on ONE task produce clean commits.

### Spec Quality > Agent Quality
A great spec with a mediocre agent beats a vague spec with a great agent. The spec is your leverage point.

### Review > Repair
It's easier to review and guide than to debug and fix. Catch drift early through periodic reviews.

### Explicit > Implicit
Agents can't read your mind. Write down constraints, anti-patterns, and decisions. What's obvious to you is invisible to the agent.

## Contributing

This harness is a living system. If you:
- Discover new failure modes
- Develop better patterns
- Find gaps in the guides
- Create examples for other project types

Document them and contribute back. The harness improves as we learn what works.

## Version

Current version: **2.0.0** (see `CHANGELOG.md` for history)

## License

Public domain. Use it, modify it, share it. No attribution required.

---

_The harness doesn't write code. It creates conditions where agents can write code reliably._