Phase 1 — Idempotent upload: - upload_docusign_template.py now upserts: PUT if template with same name exists (most recently modified), POST otherwise - --force-create flag to bypass upsert Phase 2-6 — FastAPI web UI: - web/app.py: FastAPI app with /health, static file serving - web/routers/auth.py: Adobe Sign + DocuSign OAuth start/callback/disconnect - web/routers/templates.py: template listing + migration status badges (not_migrated / migrated / needs_update) - web/routers/migrate.py: POST /api/migrate pipeline + GET /api/migrate/history - web/static/: vanilla HTML/CSS/JS side-by-side template browser UI Phase 7 — Tests (29/29 passing): - test_upload_upsert.py: 4 upsert unit tests - test_api_health/auth/templates/migrate.py: full API coverage - test_e2e.py: 7-step full pipeline end-to-end test - test_regression.py: compose output vs snapshots for 3 real templates - conftest.py: --update-snapshots CLI option Docs: IMPLEMENTATION-PLAN.md, updated EXECUTION-BOARD.md + architecture.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| AGENT-INSTRUCTIONS.md | ||
| EXECUTION-BOARD-TEMPLATE.md | ||
| EXECUTION-BOARD.md | ||
| README.md | ||
| SPEC-CREATION-GUIDE.md | ||
README.md
Agent Harness Templates
A complete system for running autonomous AI coding agents on complex projects.
Files
Core Templates (copy into your project)
| File | Purpose |
|---|---|
AGENT.md |
The agent's "system prompt" — reads this every iteration. Defines the core loop, mandatory pre-commit checklist (tests + TypeScript), commit attribution format, Tests-Added rule, and known anti-patterns |
PROJECT-SPEC.md |
Template for defining your problem. Sections for: overview, tech stack, requirements with acceptance criteria, data model, API design, constraints, phasing, anti-patterns |
DECISIONS.md |
Architecture Decision Record (ADR) template for documenting non-obvious technical choices. Prevents agent drift by creating continuity across fresh contexts |
EXECUTION-BOARD-TEMPLATE.md |
⭐ New. Pre-implementation planning artifact for a stream. Defines ALL packets, known-answer tests, and acceptance criteria BEFORE any code is written. The core of the plan-then-implement discipline. |
VALIDATION-TEMPLATE.md |
⭐ New. Per-packet evidence file written after each packet completes. Records test counts, known-answer results, and acceptance criteria tick-off. |
PROCESS-EVAL-TEMPLATE.md |
⭐ New. Stream retrospective written after merge. Honest assessment of task sizing, test-first compliance, and model quality. |
TASK-SPEC-TEMPLATE.md |
Reusable pre-delegation contract for non-trivial tasks. Defines objective, acceptance criteria, constraints, boundaries, verification, and proof artifact before work starts. |
ralph-loop.sh |
The Ralph Wiggum bash loop — spawns fresh agent instances, checks for completion signals, restarts until done. Supports Claude, Codex, Aider, Gemini, and custom agents |
model-report.ts |
Parses git log Agent: trailers to generate per-model quality table (commits, tests added, TypeScript errors). Copy to scripts/model-report.ts, add "model-report": "ts-node scripts/model-report.ts" to package.json |
scaffold-project.sh |
Helper script to scaffold a new simple or large project with core harness files, starter docs, and optional .harness/ structure. |
PROJECT-KICKOFF.md |
Project-local kickoff checklist template to confirm spec, tooling, evals, and runtime choices are ready before implementation begins. |
GAP-AUDIT-2026-04-04.md |
Point-in-time audit of the harness. Documents current strengths, gaps, priorities, and the consolidation work package. |
Process Guides (read before you start)
| File | Purpose |
|---|---|
SPEC-CREATION-GUIDE.md |
Start here. How to create a great spec through structured interview. The interview protocol, domain knowledge extraction, and spec quality checklist |
TUTORIAL.md |
Best way to learn. Complete 30-minute walkthrough building a markdown link checker CLI tool from zero. Concrete, copy-pasteable example of the entire workflow |
GETTING-STARTED.md |
Practical startup/scaffolding guide for real projects: create project root, copy templates, choose harness mode, scaffold .harness/, and start the first loop cleanly. |
CURRENT-STATE.md |
One-page executive summary of the harness: what is mature, what improved recently, and what should be improved next. |
WAVE-BASED-MANAGEMENT.md |
⭐ New. How to structure larger projects into waves, streams, and packets. The plan-then-implement discipline, execution boards, known-answer tests, and wave gates. Essential for projects with 10+ tasks. |
PLAN-MANAGEMENT.md |
How the IMPLEMENTATION_PLAN.md works — the living document agents update. Task decomposition patterns, intervention strategies, progress tracking |
REVIEW-AND-QA.md |
How to evaluate agent output. When to review, what to look for, how to course-correct. Review checklist template including model attribution and TypeScript hygiene checks |
EVAL-INFRASTRUCTURE.md |
Consolidated guide to the harness eval stack: implementation correctness, domain correctness, regression protection, and process quality. |
POST-RUN-VALIDATION.md |
How the harness decides work is really done after execution. Especially important for script-orchestrated runtimes that must not trust agent self-reporting blindly. |
SUPERVISION.md |
Optional operations layer for unattended Ralph runs. Covers supervisor/watchdog patterns, state files, and audit trails for long-running script-orchestrated sessions. |
WORKFLOW-SEAMS.md |
Map of the handoffs between spec, plan, execution boards, validation evidence, review, process evals, and runtime orchestration. |
WORKFLOW-DIAGRAM.md |
Visual map of the harness showing project phases, where each document matters most, and how the artifacts connect across the lifecycle. |
COST-OPTIMIZATION.md |
Getting more work per dollar. Request-based vs token-based billing, optimal strategies per provider, model selection guide, the hybrid strategy, anti-patterns |
OPENCLAW-INTEGRATION.md |
Running the harness in OpenClaw with sessions_spawn, cron jobs, and shell scripts. Model selection, monitoring, cost optimization |
TROUBLESHOOTING.md |
When things go wrong. The five failure modes (stuck loop, drift, overengineering, test theater, context overflow) and how to fix each |
PARALLEL-AGENTS.md |
Running multiple agents simultaneously on independent tasks. When to parallelize, how to split work, how to merge results, conflict resolution, OpenClaw patterns |
Examples & Reference
| File | Purpose |
|---|---|
EXAMPLES.md |
Worked example: Fintrove-style finance app spec + comparison of three approaches (Ezward, Ralph Wiggum, Nate Jones) |
CHANGELOG.md |
Version history and evolution of the agent harness project itself |
Quick Start
Runtime Models
There are two different harness runtime models in this system, and it helps to keep them separate:
1. Agent-Orchestrated Runtime
This is the OpenClaw/manual-orchestration model.
- a supervising agent decides what to run next
- that agent can inspect execution boards, validation evidence, git history, and prior results
- that agent can spawn sub-agents, review outcomes, and adapt the workflow dynamically
Use this when:
- you want a smart orchestrator in the loop
- you want sub-agent fan-out
- you want richer judgment between iterations
Primary guide:
OPENCLAW-INTEGRATION.md
2. Script-Orchestrated Runtime
This is the ralph-loop.sh model.
- the shell script is the orchestrator
- the script must interpret completion/stuck/error signals itself
- any judgment the supervising agent would normally provide must be encoded into runtime checks
Use this when:
- you want a portable terminal-native loop
- you want tmux/background shell operation
- you want minimal dependencies beyond the CLI agent itself
Important implication:
- if the script is the orchestrator, reliability has to come from explicit checks, not from assuming the agent will always judge correctly
- for long unattended runs, add a separate supervisor/watchdog layer rather than assuming tmux alone is sufficient
New to the Harness? (Start Here)
- Read
CURRENT-STATE.md— understand what the harness is good at right now - Read
WORKFLOW-DIAGRAM.md— get the phase map before diving into details - Read
GETTING-STARTED.md— scaffold a real project cleanly - Use
scaffold-project.shornew-harness-projectif you want the fastest reliable setup - Read
TUTORIAL.md— 30-minute hands-on walkthrough building a real CLI tool - Read
SPEC-CREATION-GUIDE.md— learn the interview protocol - Read
TASK-SPEC-TEMPLATE.md— learn the packet-sized contract for non-trivial delegation - Use
PROJECT-KICKOFF.mdin your new project as the readiness checklist - Try it — build your own project using the workflow
Ready to Build? (Simple project, <10 tasks)
- Read
COST-OPTIMIZATION.md— understand your billing model before you start burning budget - Interview — work with your agent to create the spec (or do it solo)
- Fill out
PROJECT-SPEC.mdwith your problem definition - Read
EVAL-INFRASTRUCTURE.mdif the project has calculations, regulated logic, or other high-cost-to-be-wrong behavior - Read
POST-RUN-VALIDATION.mdif the runtime will need to validate task completion mechanically - Read
SUPERVISION.mdif the script-orchestrated runtime will run unattended for hours - Copy
PROJECT-SPEC.md,AGENT.md, andDECISIONS.mdinto your project root - Choose a runtime:
./ralph-loop.shfor the script-orchestrated model- OpenClaw sessions/sub-agents for the agent-orchestrated model
- Review at phase boundaries using
REVIEW-AND-QA.mdchecklist - Troubleshoot failures using
TROUBLESHOOTING.md
For unattended script-orchestrated runs, consider adding an optional supervisor/watchdog wrapper around ralph-loop.sh so process death, stale waits, and silent stalls can be detected independently of the tmux pane.
The optional guide and starter templates live in SUPERVISION.md, supervise-ralph-loop.template.sh, and audit-ralph-loop.template.sh.
Building Something Larger? (10+ tasks, multiple features)
- Read
WAVE-BASED-MANAGEMENT.md— the plan-then-implement discipline - Read
WORKFLOW-SEAMS.md— understand how the harness artifacts hand off to each other - Read
EVAL-INFRASTRUCTURE.md— define the eval stack before implementation starts - Read
POST-RUN-VALIDATION.md— define how the runtime will decide packet completion is real - Create your
IMPLEMENTATION_PLAN.mdwith all tasks grouped into waves - Create
.harness/EXECUTION_MASTER.md— your wave/stream dashboard - For each stream: copy
EXECUTION-BOARD-TEMPLATE.md, fill ALL packets before coding any - For non-trivial delegated packets: create a task spec from
TASK-SPEC-TEMPLATE.md - After each packet: copy
VALIDATION-TEMPLATE.mdand fill it in - After each stream: copy
PROCESS-EVAL-TEMPLATE.mdand write the retrospective - At each wave boundary: run the wave gate checklist before starting the next wave
If you use ralph-loop.sh for a larger project, pass the active board explicitly:
./ralph-loop.sh --board .harness/<stream>/execution-board.md
The Core Insight
All successful agent approaches share the same loop:
Orient (read spec + plan) → Pick ONE task → Build → Test → Commit → Exit → Restart fresh
The spec defines WHAT. The plan tracks WHERE we are. Fresh context each iteration prevents drift. The human reviews and course-corrects.
In the script-orchestrated runtime, some of that review must be encoded into the loop itself. In the agent-orchestrated runtime, a supervising agent can supply more of that judgment dynamically. Task specs improve the preconditions for delegation. Post-run validation improves the postconditions.
See each file for detailed instructions.
When to Use Which Guide
┌─────────────────────────────────────────────────┐
│ "Which guide do I need?" │
├─────────────────────────────────────────────────┤
│ │
│ Just starting a real project? │
│ → GETTING-STARTED.md │
│ │
│ Want it scaffolded for you? │
│ → scaffold-project.sh / new-harness-project │
│ │
│ Need a kickoff checklist inside the project? │
│ → PROJECT-KICKOFF.md │
│ │
│ Want the one-page status view? │
│ → CURRENT-STATE.md │
│ │
│ Want the visual phase map? │
│ → WORKFLOW-DIAGRAM.md │
│ │
│ Want hands-on learning? │
│ → TUTORIAL.md (hands-on learning) │
│ │
│ Creating a spec? │
│ → SPEC-CREATION-GUIDE.md (interview) │
│ │
│ Delegating a non-trivial task? │
│ → TASK-SPEC-TEMPLATE.md │
│ │
│ Agent is stuck? │
│ → TROUBLESHOOTING.md (failure modes) │
│ │
│ Reviewing agent output? │
│ → REVIEW-AND-QA.md (what to check) │
│ │
│ Need a full eval strategy? │
│ → EVAL-INFRASTRUCTURE.md │
│ │
│ Need runtime completion checks? │
│ → POST-RUN-VALIDATION.md │
│ │
│ Need unattended-run supervision? │
│ → SUPERVISION.md │
│ │
│ Confused about how docs hand off? │
│ → WORKFLOW-SEAMS.md │
│ │
│ Worried about cost? │
│ → COST-OPTIMIZATION.md (billing models) │
│ │
│ Multiple independent features? │
│ → PARALLEL-AGENTS.md (coordination) │
│ │
│ Using OpenClaw? │
│ → OPENCLAW-INTEGRATION.md (sessions_spawn) │
│ │
│ Agent keeps changing past decisions? │
│ → DECISIONS.md (ADR template) │
│ │
│ Want to see it in action? │
│ → EXAMPLES.md (real project example) │
│ │
└─────────────────────────────────────────────────┘
Philosophy
Fresh Context > Long Context
Each iteration starts with a fresh agent. No accumulated confusion, no stale reasoning. The git history and plan file provide continuity.
One Task > Many Tasks
Agents that try to do everything in one session produce spaghetti. Agents that focus on ONE task produce clean commits.
Spec Quality > Agent Quality
A great spec with a mediocre agent beats a vague spec with a great agent. The spec is your leverage point.
Review > Repair
It's easier to review and guide than to debug and fix. Catch drift early through periodic reviews.
Explicit > Implicit
Agents can't read your mind. Write down constraints, anti-patterns, and decisions. What's obvious to you is invisible to the agent.
Contributing
This harness is a living system. If you:
- Discover new failure modes
- Develop better patterns
- Find gaps in the guides
- Create examples for other project types
Document them and contribute back. The harness improves as we learn what works.
Version
Current version: 2.0.0 (see CHANGELOG.md for history)
License
Public domain. Use it, modify it, share it. No attribution required.
The harness doesn't write code. It creates conditions where agents can write code reliably.