256 lines
16 KiB
Markdown
256 lines
16 KiB
Markdown
# Agent Harness Templates
|
|
|
|
A complete system for running autonomous AI coding agents on complex projects.
|
|
|
|
## Files
|
|
|
|
### Core Templates (copy into your project)
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `AGENT.md` | The agent's "system prompt" — reads this every iteration. Defines the core loop, mandatory pre-commit checklist (tests + TypeScript), commit attribution format, Tests-Added rule, and known anti-patterns |
|
|
| `PROJECT-SPEC.md` | Template for defining your problem. Sections for: overview, tech stack, requirements with acceptance criteria, data model, API design, constraints, phasing, anti-patterns |
|
|
| `DECISIONS.md` | Architecture Decision Record (ADR) template for documenting non-obvious technical choices. Prevents agent drift by creating continuity across fresh contexts |
|
|
| `EXECUTION-BOARD-TEMPLATE.md` | **⭐ New.** Pre-implementation planning artifact for a stream. Defines ALL packets, known-answer tests, and acceptance criteria BEFORE any code is written. The core of the plan-then-implement discipline. |
|
|
| `VALIDATION-TEMPLATE.md` | **⭐ New.** Per-packet evidence file written after each packet completes. Records test counts, known-answer results, and acceptance criteria tick-off. |
|
|
| `PROCESS-EVAL-TEMPLATE.md` | **⭐ New.** Stream retrospective written after merge. Honest assessment of task sizing, test-first compliance, and model quality. |
|
|
| `TASK-SPEC-TEMPLATE.md` | Reusable pre-delegation contract for non-trivial tasks. Defines objective, acceptance criteria, constraints, boundaries, verification, and proof artifact before work starts. |
|
|
| `ralph-loop.sh` | The Ralph Wiggum bash loop — spawns fresh agent instances, checks for completion signals, restarts until done. Supports Claude, Codex, Aider, Gemini, and custom agents |
|
|
| `model-report.ts` | Parses git log `Agent:` trailers to generate per-model quality table (commits, tests added, TypeScript errors). Copy to `scripts/model-report.ts`, add `"model-report": "ts-node scripts/model-report.ts"` to package.json |
|
|
| `scaffold-project.sh` | Helper script to scaffold a new simple or large project with core harness files, starter docs, and optional `.harness/` structure. |
|
|
| `PROJECT-KICKOFF.md` | Project-local kickoff checklist template to confirm spec, tooling, evals, and runtime choices are ready before implementation begins. |
|
|
| `GAP-AUDIT-2026-04-04.md` | Point-in-time audit of the harness. Documents current strengths, gaps, priorities, and the consolidation work package. |
|
|
|
|
### Process Guides (read before you start)
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `SPEC-CREATION-GUIDE.md` | **Start here.** How to create a great spec through structured interview. The interview protocol, domain knowledge extraction, and spec quality checklist |
|
|
| `TUTORIAL.md` | **Best way to learn.** Complete 30-minute walkthrough building a markdown link checker CLI tool from zero. Concrete, copy-pasteable example of the entire workflow |
|
|
| `GETTING-STARTED.md` | Practical startup/scaffolding guide for real projects: create project root, copy templates, choose harness mode, scaffold `.harness/`, and start the first loop cleanly. |
|
|
| `CURRENT-STATE.md` | One-page executive summary of the harness: what is mature, what improved recently, and what should be improved next. |
|
|
| `WAVE-BASED-MANAGEMENT.md` | **⭐ New.** How to structure larger projects into waves, streams, and packets. The plan-then-implement discipline, execution boards, known-answer tests, and wave gates. Essential for projects with 10+ tasks. |
|
|
| `PLAN-MANAGEMENT.md` | How the IMPLEMENTATION_PLAN.md works — the living document agents update. Task decomposition patterns, intervention strategies, progress tracking |
|
|
| `REVIEW-AND-QA.md` | How to evaluate agent output. When to review, what to look for, how to course-correct. Review checklist template including model attribution and TypeScript hygiene checks |
|
|
| `EVAL-INFRASTRUCTURE.md` | Consolidated guide to the harness eval stack: implementation correctness, domain correctness, regression protection, and process quality. |
|
|
| `POST-RUN-VALIDATION.md` | How the harness decides work is really done after execution. Especially important for script-orchestrated runtimes that must not trust agent self-reporting blindly. |
|
|
| `SUPERVISION.md` | Optional operations layer for unattended Ralph runs. Covers supervisor/watchdog patterns, state files, and audit trails for long-running script-orchestrated sessions. |
|
|
| `WORKFLOW-SEAMS.md` | Map of the handoffs between spec, plan, execution boards, validation evidence, review, process evals, and runtime orchestration. |
|
|
| `WORKFLOW-DIAGRAM.md` | Visual map of the harness showing project phases, where each document matters most, and how the artifacts connect across the lifecycle. |
|
|
| `COST-OPTIMIZATION.md` | Getting more work per dollar. Request-based vs token-based billing, optimal strategies per provider, model selection guide, the hybrid strategy, anti-patterns |
|
|
| `OPENCLAW-INTEGRATION.md` | Running the harness in OpenClaw with sessions_spawn, cron jobs, and shell scripts. Model selection, monitoring, cost optimization |
|
|
| `TROUBLESHOOTING.md` | When things go wrong. The five failure modes (stuck loop, drift, overengineering, test theater, context overflow) and how to fix each |
|
|
| `PARALLEL-AGENTS.md` | Running multiple agents simultaneously on independent tasks. When to parallelize, how to split work, how to merge results, conflict resolution, OpenClaw patterns |
|
|
|
|
### Examples & Reference
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `EXAMPLES.md` | Worked example: Fintrove-style finance app spec + comparison of three approaches (Ezward, Ralph Wiggum, Nate Jones) |
|
|
| `CHANGELOG.md` | Version history and evolution of the agent harness project itself |
|
|
|
|
## Quick Start
|
|
|
|
## Runtime Models
|
|
|
|
There are two different harness runtime models in this system, and it helps to keep them separate:
|
|
|
|
### 1. Agent-Orchestrated Runtime
|
|
|
|
This is the OpenClaw/manual-orchestration model.
|
|
|
|
- a supervising agent decides what to run next
|
|
- that agent can inspect execution boards, validation evidence, git history, and prior results
|
|
- that agent can spawn sub-agents, review outcomes, and adapt the workflow dynamically
|
|
|
|
Use this when:
|
|
- you want a smart orchestrator in the loop
|
|
- you want sub-agent fan-out
|
|
- you want richer judgment between iterations
|
|
|
|
Primary guide:
|
|
- `OPENCLAW-INTEGRATION.md`
|
|
|
|
### 2. Script-Orchestrated Runtime
|
|
|
|
This is the `ralph-loop.sh` model.
|
|
|
|
- the shell script is the orchestrator
|
|
- the script must interpret completion/stuck/error signals itself
|
|
- any judgment the supervising agent would normally provide must be encoded into runtime checks
|
|
|
|
Use this when:
|
|
- you want a portable terminal-native loop
|
|
- you want tmux/background shell operation
|
|
- you want minimal dependencies beyond the CLI agent itself
|
|
|
|
Important implication:
|
|
- if the script is the orchestrator, reliability has to come from explicit checks, not from assuming the agent will always judge correctly
|
|
- for long unattended runs, add a separate supervisor/watchdog layer rather than assuming tmux alone is sufficient
|
|
|
|
### New to the Harness? (Start Here)
|
|
1. **Read** `CURRENT-STATE.md` — understand what the harness is good at right now
|
|
2. **Read** `WORKFLOW-DIAGRAM.md` — get the phase map before diving into details
|
|
3. **Read** `GETTING-STARTED.md` — scaffold a real project cleanly
|
|
4. **Use** `scaffold-project.sh` or `new-harness-project` if you want the fastest reliable setup
|
|
5. **Read** `TUTORIAL.md` — 30-minute hands-on walkthrough building a real CLI tool
|
|
6. **Read** `SPEC-CREATION-GUIDE.md` — learn the interview protocol
|
|
7. **Read** `TASK-SPEC-TEMPLATE.md` — learn the packet-sized contract for non-trivial delegation
|
|
8. **Use** `PROJECT-KICKOFF.md` in your new project as the readiness checklist
|
|
9. **Try it** — build your own project using the workflow
|
|
|
|
### Ready to Build? (Simple project, <10 tasks)
|
|
1. **Read** `COST-OPTIMIZATION.md` — understand your billing model before you start burning budget
|
|
2. **Interview** — work with your agent to create the spec (or do it solo)
|
|
3. **Fill out** `PROJECT-SPEC.md` with your problem definition
|
|
4. **Read** `EVAL-INFRASTRUCTURE.md` if the project has calculations, regulated logic, or other high-cost-to-be-wrong behavior
|
|
5. **Read** `POST-RUN-VALIDATION.md` if the runtime will need to validate task completion mechanically
|
|
6. **Read** `SUPERVISION.md` if the script-orchestrated runtime will run unattended for hours
|
|
7. **Copy** `PROJECT-SPEC.md`, `AGENT.md`, and `DECISIONS.md` into your project root
|
|
8. **Choose a runtime**:
|
|
- `./ralph-loop.sh` for the script-orchestrated model
|
|
- OpenClaw sessions/sub-agents for the agent-orchestrated model
|
|
9. **Review** at phase boundaries using `REVIEW-AND-QA.md` checklist
|
|
10. **Troubleshoot** failures using `TROUBLESHOOTING.md`
|
|
|
|
For unattended script-orchestrated runs, consider adding an optional supervisor/watchdog wrapper around `ralph-loop.sh` so process death, stale waits, and silent stalls can be detected independently of the tmux pane.
|
|
The optional guide and starter templates live in `SUPERVISION.md`, `supervise-ralph-loop.template.sh`, and `audit-ralph-loop.template.sh`.
|
|
|
|
### Building Something Larger? (10+ tasks, multiple features)
|
|
1. **Read** `WAVE-BASED-MANAGEMENT.md` — the plan-then-implement discipline
|
|
2. **Read** `WORKFLOW-SEAMS.md` — understand how the harness artifacts hand off to each other
|
|
3. **Read** `EVAL-INFRASTRUCTURE.md` — define the eval stack before implementation starts
|
|
4. **Read** `POST-RUN-VALIDATION.md` — define how the runtime will decide packet completion is real
|
|
5. **Create** your `IMPLEMENTATION_PLAN.md` with all tasks grouped into waves
|
|
6. **Create** `.harness/EXECUTION_MASTER.md` — your wave/stream dashboard
|
|
7. **For each stream:** copy `EXECUTION-BOARD-TEMPLATE.md`, fill ALL packets before coding any
|
|
8. **For non-trivial delegated packets:** create a task spec from `TASK-SPEC-TEMPLATE.md`
|
|
9. **After each packet:** copy `VALIDATION-TEMPLATE.md` and fill it in
|
|
10. **After each stream:** copy `PROCESS-EVAL-TEMPLATE.md` and write the retrospective
|
|
11. **At each wave boundary:** run the wave gate checklist before starting the next wave
|
|
|
|
If you use `ralph-loop.sh` for a larger project, pass the active board explicitly:
|
|
|
|
```bash
|
|
./ralph-loop.sh --board .harness/<stream>/execution-board.md
|
|
```
|
|
|
|
## The Core Insight
|
|
|
|
All successful agent approaches share the same loop:
|
|
|
|
```
|
|
Orient (read spec + plan) → Pick ONE task → Build → Test → Commit → Exit → Restart fresh
|
|
```
|
|
|
|
The spec defines WHAT. The plan tracks WHERE we are. Fresh context each iteration prevents drift. The human reviews and course-corrects.
|
|
|
|
In the script-orchestrated runtime, some of that review must be encoded into the loop itself.
|
|
In the agent-orchestrated runtime, a supervising agent can supply more of that judgment dynamically.
|
|
Task specs improve the preconditions for delegation. Post-run validation improves the postconditions.
|
|
|
|
See each file for detailed instructions.
|
|
|
|
## When to Use Which Guide
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────┐
|
|
│ "Which guide do I need?" │
|
|
├─────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Just starting a real project? │
|
|
│ → GETTING-STARTED.md │
|
|
│ │
|
|
│ Want it scaffolded for you? │
|
|
│ → scaffold-project.sh / new-harness-project │
|
|
│ │
|
|
│ Need a kickoff checklist inside the project? │
|
|
│ → PROJECT-KICKOFF.md │
|
|
│ │
|
|
│ Want the one-page status view? │
|
|
│ → CURRENT-STATE.md │
|
|
│ │
|
|
│ Want the visual phase map? │
|
|
│ → WORKFLOW-DIAGRAM.md │
|
|
│ │
|
|
│ Want hands-on learning? │
|
|
│ → TUTORIAL.md (hands-on learning) │
|
|
│ │
|
|
│ Creating a spec? │
|
|
│ → SPEC-CREATION-GUIDE.md (interview) │
|
|
│ │
|
|
│ Delegating a non-trivial task? │
|
|
│ → TASK-SPEC-TEMPLATE.md │
|
|
│ │
|
|
│ Agent is stuck? │
|
|
│ → TROUBLESHOOTING.md (failure modes) │
|
|
│ │
|
|
│ Reviewing agent output? │
|
|
│ → REVIEW-AND-QA.md (what to check) │
|
|
│ │
|
|
│ Need a full eval strategy? │
|
|
│ → EVAL-INFRASTRUCTURE.md │
|
|
│ │
|
|
│ Need runtime completion checks? │
|
|
│ → POST-RUN-VALIDATION.md │
|
|
│ │
|
|
│ Need unattended-run supervision? │
|
|
│ → SUPERVISION.md │
|
|
│ │
|
|
│ Confused about how docs hand off? │
|
|
│ → WORKFLOW-SEAMS.md │
|
|
│ │
|
|
│ Worried about cost? │
|
|
│ → COST-OPTIMIZATION.md (billing models) │
|
|
│ │
|
|
│ Multiple independent features? │
|
|
│ → PARALLEL-AGENTS.md (coordination) │
|
|
│ │
|
|
│ Using OpenClaw? │
|
|
│ → OPENCLAW-INTEGRATION.md (sessions_spawn) │
|
|
│ │
|
|
│ Agent keeps changing past decisions? │
|
|
│ → DECISIONS.md (ADR template) │
|
|
│ │
|
|
│ Want to see it in action? │
|
|
│ → EXAMPLES.md (real project example) │
|
|
│ │
|
|
└─────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Philosophy
|
|
|
|
### Fresh Context > Long Context
|
|
Each iteration starts with a fresh agent. No accumulated confusion, no stale reasoning. The git history and plan file provide continuity.
|
|
|
|
### One Task > Many Tasks
|
|
Agents that try to do everything in one session produce spaghetti. Agents that focus on ONE task produce clean commits.
|
|
|
|
### Spec Quality > Agent Quality
|
|
A great spec with a mediocre agent beats a vague spec with a great agent. The spec is your leverage point.
|
|
|
|
### Review > Repair
|
|
It's easier to review and guide than to debug and fix. Catch drift early through periodic reviews.
|
|
|
|
### Explicit > Implicit
|
|
Agents can't read your mind. Write down constraints, anti-patterns, and decisions. What's obvious to you is invisible to the agent.
|
|
|
|
## Contributing
|
|
|
|
This harness is a living system. If you:
|
|
- Discover new failure modes
|
|
- Develop better patterns
|
|
- Find gaps in the guides
|
|
- Create examples for other project types
|
|
|
|
Document them and contribute back. The harness improves as we learn what works.
|
|
|
|
## Version
|
|
|
|
Current version: **2.0.0** (see `CHANGELOG.md` for history)
|
|
|
|
## License
|
|
|
|
Public domain. Use it, modify it, share it. No attribution required.
|
|
|
|
---
|
|
|
|
_The harness doesn't write code. It creates conditions where agents can write code reliably._
|