History

Paul Huliganga 51f532f452 feat: idempotent upload + FastAPI web UI with full test coverage Phase 1 — Idempotent upload: - upload_docusign_template.py now upserts: PUT if template with same name exists (most recently modified), POST otherwise - --force-create flag to bypass upsert Phase 2-6 — FastAPI web UI: - web/app.py: FastAPI app with /health, static file serving - web/routers/auth.py: Adobe Sign + DocuSign OAuth start/callback/disconnect - web/routers/templates.py: template listing + migration status badges (not_migrated / migrated / needs_update) - web/routers/migrate.py: POST /api/migrate pipeline + GET /api/migrate/history - web/static/: vanilla HTML/CSS/JS side-by-side template browser UI Phase 7 — Tests (29/29 passing): - test_upload_upsert.py: 4 upsert unit tests - test_api_health/auth/templates/migrate.py: full API coverage - test_e2e.py: 7-step full pipeline end-to-end test - test_regression.py: compose output vs snapshots for 3 real templates - conftest.py: --update-snapshots CLI option Docs: IMPLEMENTATION-PLAN.md, updated EXECUTION-BOARD.md + architecture.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-04-17 14:47:27 -04:00
..
AGENT-INSTRUCTIONS.md	Initial project scaffold (Cleo)	2026-04-14 19:21:17 -04:00
EXECUTION-BOARD-TEMPLATE.md	Initial project scaffold (Cleo)	2026-04-14 19:21:17 -04:00
EXECUTION-BOARD.md	feat: idempotent upload + FastAPI web UI with full test coverage	2026-04-17 14:47:27 -04:00
README.md	Initial project scaffold (Cleo)	2026-04-14 19:21:17 -04:00
SPEC-CREATION-GUIDE.md	Initial project scaffold (Cleo)	2026-04-14 19:21:17 -04:00

README.md

Agent Harness Templates

A complete system for running autonomous AI coding agents on complex projects.

Files

Core Templates (copy into your project)

File	Purpose
`AGENT.md`	The agent's "system prompt" — reads this every iteration. Defines the core loop, mandatory pre-commit checklist (tests + TypeScript), commit attribution format, Tests-Added rule, and known anti-patterns
`PROJECT-SPEC.md`	Template for defining your problem. Sections for: overview, tech stack, requirements with acceptance criteria, data model, API design, constraints, phasing, anti-patterns
`DECISIONS.md`	Architecture Decision Record (ADR) template for documenting non-obvious technical choices. Prevents agent drift by creating continuity across fresh contexts
`EXECUTION-BOARD-TEMPLATE.md`	⭐ New. Pre-implementation planning artifact for a stream. Defines ALL packets, known-answer tests, and acceptance criteria BEFORE any code is written. The core of the plan-then-implement discipline.
`VALIDATION-TEMPLATE.md`	⭐ New. Per-packet evidence file written after each packet completes. Records test counts, known-answer results, and acceptance criteria tick-off.
`PROCESS-EVAL-TEMPLATE.md`	⭐ New. Stream retrospective written after merge. Honest assessment of task sizing, test-first compliance, and model quality.
`TASK-SPEC-TEMPLATE.md`	Reusable pre-delegation contract for non-trivial tasks. Defines objective, acceptance criteria, constraints, boundaries, verification, and proof artifact before work starts.
`ralph-loop.sh`	The Ralph Wiggum bash loop — spawns fresh agent instances, checks for completion signals, restarts until done. Supports Claude, Codex, Aider, Gemini, and custom agents
`model-report.ts`	Parses git log `Agent:` trailers to generate per-model quality table (commits, tests added, TypeScript errors). Copy to `scripts/model-report.ts`, add `"model-report": "ts-node scripts/model-report.ts"` to package.json
`scaffold-project.sh`	Helper script to scaffold a new simple or large project with core harness files, starter docs, and optional `.harness/` structure.
`PROJECT-KICKOFF.md`	Project-local kickoff checklist template to confirm spec, tooling, evals, and runtime choices are ready before implementation begins.
`GAP-AUDIT-2026-04-04.md`	Point-in-time audit of the harness. Documents current strengths, gaps, priorities, and the consolidation work package.

Process Guides (read before you start)

File	Purpose
`SPEC-CREATION-GUIDE.md`	Start here. How to create a great spec through structured interview. The interview protocol, domain knowledge extraction, and spec quality checklist
`TUTORIAL.md`	Best way to learn. Complete 30-minute walkthrough building a markdown link checker CLI tool from zero. Concrete, copy-pasteable example of the entire workflow
`GETTING-STARTED.md`	Practical startup/scaffolding guide for real projects: create project root, copy templates, choose harness mode, scaffold `.harness/`, and start the first loop cleanly.
`CURRENT-STATE.md`	One-page executive summary of the harness: what is mature, what improved recently, and what should be improved next.
`WAVE-BASED-MANAGEMENT.md`	⭐ New. How to structure larger projects into waves, streams, and packets. The plan-then-implement discipline, execution boards, known-answer tests, and wave gates. Essential for projects with 10+ tasks.
`PLAN-MANAGEMENT.md`	How the IMPLEMENTATION_PLAN.md works — the living document agents update. Task decomposition patterns, intervention strategies, progress tracking
`REVIEW-AND-QA.md`	How to evaluate agent output. When to review, what to look for, how to course-correct. Review checklist template including model attribution and TypeScript hygiene checks
`EVAL-INFRASTRUCTURE.md`	Consolidated guide to the harness eval stack: implementation correctness, domain correctness, regression protection, and process quality.
`POST-RUN-VALIDATION.md`	How the harness decides work is really done after execution. Especially important for script-orchestrated runtimes that must not trust agent self-reporting blindly.
`SUPERVISION.md`	Optional operations layer for unattended Ralph runs. Covers supervisor/watchdog patterns, state files, and audit trails for long-running script-orchestrated sessions.
`WORKFLOW-SEAMS.md`	Map of the handoffs between spec, plan, execution boards, validation evidence, review, process evals, and runtime orchestration.
`WORKFLOW-DIAGRAM.md`	Visual map of the harness showing project phases, where each document matters most, and how the artifacts connect across the lifecycle.
`COST-OPTIMIZATION.md`	Getting more work per dollar. Request-based vs token-based billing, optimal strategies per provider, model selection guide, the hybrid strategy, anti-patterns
`OPENCLAW-INTEGRATION.md`	Running the harness in OpenClaw with sessions_spawn, cron jobs, and shell scripts. Model selection, monitoring, cost optimization
`TROUBLESHOOTING.md`	When things go wrong. The five failure modes (stuck loop, drift, overengineering, test theater, context overflow) and how to fix each
`PARALLEL-AGENTS.md`	Running multiple agents simultaneously on independent tasks. When to parallelize, how to split work, how to merge results, conflict resolution, OpenClaw patterns

Examples & Reference

File	Purpose
`EXAMPLES.md`	Worked example: Fintrove-style finance app spec + comparison of three approaches (Ezward, Ralph Wiggum, Nate Jones)
`CHANGELOG.md`	Version history and evolution of the agent harness project itself

Quick Start

Runtime Models

There are two different harness runtime models in this system, and it helps to keep them separate:

1. Agent-Orchestrated Runtime

This is the OpenClaw/manual-orchestration model.

a supervising agent decides what to run next
that agent can inspect execution boards, validation evidence, git history, and prior results
that agent can spawn sub-agents, review outcomes, and adapt the workflow dynamically

Use this when:

you want a smart orchestrator in the loop
you want sub-agent fan-out
you want richer judgment between iterations

Primary guide:

OPENCLAW-INTEGRATION.md

2. Script-Orchestrated Runtime

This is the ralph-loop.sh model.

the shell script is the orchestrator
the script must interpret completion/stuck/error signals itself
any judgment the supervising agent would normally provide must be encoded into runtime checks

Use this when:

you want a portable terminal-native loop
you want tmux/background shell operation
you want minimal dependencies beyond the CLI agent itself

Important implication:

if the script is the orchestrator, reliability has to come from explicit checks, not from assuming the agent will always judge correctly
for long unattended runs, add a separate supervisor/watchdog layer rather than assuming tmux alone is sufficient

New to the Harness? (Start Here)

Read CURRENT-STATE.md — understand what the harness is good at right now
Read WORKFLOW-DIAGRAM.md — get the phase map before diving into details
Read GETTING-STARTED.md — scaffold a real project cleanly
Use scaffold-project.sh or new-harness-project if you want the fastest reliable setup
Read TUTORIAL.md — 30-minute hands-on walkthrough building a real CLI tool
Read SPEC-CREATION-GUIDE.md — learn the interview protocol
Read TASK-SPEC-TEMPLATE.md — learn the packet-sized contract for non-trivial delegation
Use PROJECT-KICKOFF.md in your new project as the readiness checklist
Try it — build your own project using the workflow

Ready to Build? (Simple project, <10 tasks)

Read COST-OPTIMIZATION.md — understand your billing model before you start burning budget
Interview — work with your agent to create the spec (or do it solo)
Fill out PROJECT-SPEC.md with your problem definition
Read EVAL-INFRASTRUCTURE.md if the project has calculations, regulated logic, or other high-cost-to-be-wrong behavior
Read POST-RUN-VALIDATION.md if the runtime will need to validate task completion mechanically
Read SUPERVISION.md if the script-orchestrated runtime will run unattended for hours
Copy PROJECT-SPEC.md, AGENT.md, and DECISIONS.md into your project root
Choose a runtime:
- ./ralph-loop.sh for the script-orchestrated model
- OpenClaw sessions/sub-agents for the agent-orchestrated model
Review at phase boundaries using REVIEW-AND-QA.md checklist
Troubleshoot failures using TROUBLESHOOTING.md

For unattended script-orchestrated runs, consider adding an optional supervisor/watchdog wrapper around ralph-loop.sh so process death, stale waits, and silent stalls can be detected independently of the tmux pane. The optional guide and starter templates live in SUPERVISION.md, supervise-ralph-loop.template.sh, and audit-ralph-loop.template.sh.

Building Something Larger? (10+ tasks, multiple features)

Read WAVE-BASED-MANAGEMENT.md — the plan-then-implement discipline
Read WORKFLOW-SEAMS.md — understand how the harness artifacts hand off to each other
Read EVAL-INFRASTRUCTURE.md — define the eval stack before implementation starts
Read POST-RUN-VALIDATION.md — define how the runtime will decide packet completion is real
Create your IMPLEMENTATION_PLAN.md with all tasks grouped into waves
Create .harness/EXECUTION_MASTER.md — your wave/stream dashboard
For each stream: copy EXECUTION-BOARD-TEMPLATE.md, fill ALL packets before coding any
For non-trivial delegated packets: create a task spec from TASK-SPEC-TEMPLATE.md
After each packet: copy VALIDATION-TEMPLATE.md and fill it in
After each stream: copy PROCESS-EVAL-TEMPLATE.md and write the retrospective
At each wave boundary: run the wave gate checklist before starting the next wave

If you use ralph-loop.sh for a larger project, pass the active board explicitly:

./ralph-loop.sh --board .harness/<stream>/execution-board.md

The Core Insight

All successful agent approaches share the same loop:

Orient (read spec + plan) → Pick ONE task → Build → Test → Commit → Exit → Restart fresh

The spec defines WHAT. The plan tracks WHERE we are. Fresh context each iteration prevents drift. The human reviews and course-corrects.

In the script-orchestrated runtime, some of that review must be encoded into the loop itself. In the agent-orchestrated runtime, a supervising agent can supply more of that judgment dynamically. Task specs improve the preconditions for delegation. Post-run validation improves the postconditions.

See each file for detailed instructions.

When to Use Which Guide

┌─────────────────────────────────────────────────┐
│           "Which guide do I need?"               │
├─────────────────────────────────────────────────┤
│                                                  │
│  Just starting a real project?                   │
│    → GETTING-STARTED.md                         │
│                                                  │
│  Want it scaffolded for you?                     │
│    → scaffold-project.sh / new-harness-project │
│                                                  │
│  Need a kickoff checklist inside the project?    │
│    → PROJECT-KICKOFF.md                         │
│                                                  │
│  Want the one-page status view?                  │
│    → CURRENT-STATE.md                           │
│                                                  │
│  Want the visual phase map?                      │
│    → WORKFLOW-DIAGRAM.md                        │
│                                                  │
│  Want hands-on learning?                         │
│    → TUTORIAL.md (hands-on learning)            │
│                                                  │
│  Creating a spec?                                │
│    → SPEC-CREATION-GUIDE.md (interview)         │
│                                                  │
│  Delegating a non-trivial task?                  │
│    → TASK-SPEC-TEMPLATE.md                      │
│                                                  │
│  Agent is stuck?                                 │
│    → TROUBLESHOOTING.md (failure modes)         │
│                                                  │
│  Reviewing agent output?                         │
│    → REVIEW-AND-QA.md (what to check)           │
│                                                  │
│  Need a full eval strategy?                      │
│    → EVAL-INFRASTRUCTURE.md                     │
│                                                  │
│  Need runtime completion checks?                 │
│    → POST-RUN-VALIDATION.md                     │
│                                                  │
│  Need unattended-run supervision?                │
│    → SUPERVISION.md                             │
│                                                  │
│  Confused about how docs hand off?               │
│    → WORKFLOW-SEAMS.md                          │
│                                                  │
│  Worried about cost?                             │
│    → COST-OPTIMIZATION.md (billing models)      │
│                                                  │
│  Multiple independent features?                  │
│    → PARALLEL-AGENTS.md (coordination)          │
│                                                  │
│  Using OpenClaw?                                 │
│    → OPENCLAW-INTEGRATION.md (sessions_spawn)   │
│                                                  │
│  Agent keeps changing past decisions?            │
│    → DECISIONS.md (ADR template)                │
│                                                  │
│  Want to see it in action?                       │
│    → EXAMPLES.md (real project example)         │
│                                                  │
└─────────────────────────────────────────────────┘

Philosophy

Fresh Context > Long Context

Each iteration starts with a fresh agent. No accumulated confusion, no stale reasoning. The git history and plan file provide continuity.

One Task > Many Tasks

Agents that try to do everything in one session produce spaghetti. Agents that focus on ONE task produce clean commits.

Spec Quality > Agent Quality

A great spec with a mediocre agent beats a vague spec with a great agent. The spec is your leverage point.

Review > Repair

It's easier to review and guide than to debug and fix. Catch drift early through periodic reviews.

Explicit > Implicit

Agents can't read your mind. Write down constraints, anti-patterns, and decisions. What's obvious to you is invisible to the agent.

Contributing

This harness is a living system. If you:

Discover new failure modes
Develop better patterns
Find gaps in the guides
Create examples for other project types

Document them and contribute back. The harness improves as we learn what works.

Version

Current version: 2.0.0 (see CHANGELOG.md for history)

License

Public domain. Use it, modify it, share it. No attribution required.

The harness doesn't write code. It creates conditions where agents can write code reliably.