9.0 KiB

Raw Permalink Blame History

Agent Harness Templates

A complete system for running autonomous AI coding agents on complex projects.

Files

Core Templates (copy into your project)

File	Purpose
`AGENT-INSTRUCTIONS.md`	The agent's "system prompt" — reads this every iteration. Defines the core loop, mandatory pre-commit checklist (tests + TypeScript), commit attribution format, Tests-Added rule, and known anti-patterns
`PROJECT-SPEC.md`	Template for defining your problem. Sections for: overview, tech stack, requirements with acceptance criteria, data model, API design, constraints, phasing, anti-patterns
`DECISIONS.md`	Architecture Decision Record (ADR) template for documenting non-obvious technical choices. Prevents agent drift by creating continuity across fresh contexts
`EXECUTION-BOARD-TEMPLATE.md`	⭐ New. Pre-implementation planning artifact for a stream. Defines ALL packets, known-answer tests, and acceptance criteria BEFORE any code is written. The core of the plan-then-implement discipline.
`VALIDATION-TEMPLATE.md`	⭐ New. Per-packet evidence file written after each packet completes. Records test counts, known-answer results, and acceptance criteria tick-off.
`PROCESS-EVAL-TEMPLATE.md`	⭐ New. Stream retrospective written after merge. Honest assessment of task sizing, test-first compliance, and model quality.
`ralph-loop.sh`	The Ralph Wiggum bash loop — spawns fresh agent instances, checks for completion signals, restarts until done. Supports Claude, Codex, Aider, Gemini, and custom agents
`model-report.ts`	Parses git log `Agent:` trailers to generate per-model quality table (commits, tests added, TypeScript errors). Copy to `scripts/model-report.ts`, add `"model-report": "ts-node scripts/model-report.ts"` to package.json

Process Guides (read before you start)

File	Purpose
`SPEC-CREATION-GUIDE.md`	Start here. How to create a great spec through structured interview. The interview protocol, domain knowledge extraction, and spec quality checklist
`TUTORIAL.md`	Best way to learn. Complete 30-minute walkthrough building a markdown link checker CLI tool from zero. Concrete, copy-pasteable example of the entire workflow
`WAVE-BASED-MANAGEMENT.md`	⭐ New. How to structure larger projects into waves, streams, and packets. The plan-then-implement discipline, execution boards, known-answer tests, and wave gates. Essential for projects with 10+ tasks.
`PLAN-MANAGEMENT.md`	How the IMPLEMENTATION_PLAN.md works — the living document agents update. Task decomposition patterns, intervention strategies, progress tracking
`REVIEW-AND-QA.md`	How to evaluate agent output. When to review, what to look for, how to course-correct. Review checklist template including model attribution and TypeScript hygiene checks
`COST-OPTIMIZATION.md`	Getting more work per dollar. Request-based vs token-based billing, optimal strategies per provider, model selection guide, the hybrid strategy, anti-patterns
`OPENCLAW-INTEGRATION.md`	Running the harness in OpenClaw with sessions_spawn, cron jobs, and shell scripts. Model selection, monitoring, cost optimization
`TROUBLESHOOTING.md`	When things go wrong. The five failure modes (stuck loop, drift, overengineering, test theater, context overflow) and how to fix each
`PARALLEL-AGENTS.md`	Running multiple agents simultaneously on independent tasks. When to parallelize, how to split work, how to merge results, conflict resolution, OpenClaw patterns

Examples & Reference

File	Purpose
`EXAMPLES.md`	Worked example: Fintrove-style finance app spec + comparison of three approaches (Ezward, Ralph Wiggum, Nate Jones)
`CHANGELOG.md`	Version history and evolution of the agent harness project itself

Quick Start

New to the Harness? (Start Here)

Read TUTORIAL.md — 30-minute hands-on walkthrough building a real CLI tool
Read SPEC-CREATION-GUIDE.md — learn the interview protocol
Try it — build your own project using the workflow

Ready to Build? (Simple project, <10 tasks)

Read COST-OPTIMIZATION.md — understand your billing model before you start burning budget
Interview — work with your agent to create the spec (or do it solo)
Fill out PROJECT-SPEC.md with your problem definition
Copy PROJECT-SPEC.md, AGENT-INSTRUCTIONS.md, and DECISIONS.md into your project root
Run ./ralph-loop.sh (CLI) or use OpenClaw sessions_spawn (see OPENCLAW-INTEGRATION.md)
Review at phase boundaries using REVIEW-AND-QA.md checklist
Troubleshoot failures using TROUBLESHOOTING.md

Building Something Larger? (10+ tasks, multiple features)

Read WAVE-BASED-MANAGEMENT.md — the plan-then-implement discipline
Create your IMPLEMENTATION_PLAN.md with all tasks grouped into waves
Create .harness/EXECUTION_MASTER.md — your wave/stream dashboard
For each stream: copy EXECUTION-BOARD-TEMPLATE.md, fill ALL packets before coding any
After each packet: copy VALIDATION-TEMPLATE.md and fill it in
After each stream: copy PROCESS-EVAL-TEMPLATE.md and write the retrospective
At each wave boundary: run the wave gate checklist before starting the next wave

The Core Insight

All successful agent approaches share the same loop:

Orient (read spec + plan) → Pick ONE task → Build → Test → Commit → Exit → Restart fresh

The spec defines WHAT. The plan tracks WHERE we are. Fresh context each iteration prevents drift. The human reviews and course-corrects.

See each file for detailed instructions.

When to Use Which Guide

┌─────────────────────────────────────────────────┐
│           "Which guide do I need?"               │
├─────────────────────────────────────────────────┤
│                                                  │
│  Just starting?                                  │
│    → TUTORIAL.md (hands-on learning)            │
│                                                  │
│  Creating a spec?                                │
│    → SPEC-CREATION-GUIDE.md (interview)         │
│                                                  │
│  Agent is stuck?                                 │
│    → TROUBLESHOOTING.md (failure modes)         │
│                                                  │
│  Reviewing agent output?                         │
│    → REVIEW-AND-QA.md (what to check)           │
│                                                  │
│  Worried about cost?                             │
│    → COST-OPTIMIZATION.md (billing models)      │
│                                                  │
│  Multiple independent features?                  │
│    → PARALLEL-AGENTS.md (coordination)          │
│                                                  │
│  Using OpenClaw?                                 │
│    → OPENCLAW-INTEGRATION.md (sessions_spawn)   │
│                                                  │
│  Agent keeps changing past decisions?            │
│    → DECISIONS.md (ADR template)                │
│                                                  │
│  Want to see it in action?                       │
│    → EXAMPLES.md (real project example)         │
│                                                  │
└─────────────────────────────────────────────────┘

Philosophy

Fresh Context > Long Context

Each iteration starts with a fresh agent. No accumulated confusion, no stale reasoning. The git history and plan file provide continuity.

One Task > Many Tasks

Agents that try to do everything in one session produce spaghetti. Agents that focus on ONE task produce clean commits.

Spec Quality > Agent Quality

A great spec with a mediocre agent beats a vague spec with a great agent. The spec is your leverage point.

Review > Repair

It's easier to review and guide than to debug and fix. Catch drift early through periodic reviews.

Explicit > Implicit

Agents can't read your mind. Write down constraints, anti-patterns, and decisions. What's obvious to you is invisible to the agent.

Contributing

This harness is a living system. If you:

Discover new failure modes
Develop better patterns
Find gaps in the guides
Create examples for other project types

Document them and contribute back. The harness improves as we learn what works.

Version

Current version: 2.0.0 (see CHANGELOG.md for history)

License

Public domain. Use it, modify it, share it. No attribution required.

The harness doesn't write code. It creates conditions where agents can write code reliably.

9.0 KiB Raw Permalink Blame History