18 KiB

Raw Permalink Blame History

Project Specification: Agent Harness System

1. Project Overview

What are we building?

The Agent Harness System is a collection of templates, scripts, and best practices for running autonomous AI-powered coding agents on complex software projects. It provides a structured framework to decompose large projects into manageable tasks, execute them iteratively with fresh agent contexts, and maintain high-quality code through mandatory testing and verification.

Why does it matter?

Traditional AI coding assistants struggle with large, multi-step projects due to context window limitations and the need for iterative refinement. The Agent Harness addresses this by providing a "Ralph Wiggum Loop" mechanism that spawns fresh agents for each task iteration, preventing context drift while maintaining project coherence through structured documentation and git-based memory.

Success criteria

Agents can autonomously decompose complex project specs into testable tasks
Fresh agent iterations prevent context overflow and stale reasoning
Mandatory build/test cycles ensure code quality
Git history serves as reliable inter-iteration memory
System works with multiple AI agents (Claude, Codex, etc.)
Clear signals for completion, stuck states, and errors
Comprehensive documentation enables easy adoption

2. Technical Foundation

Tech stack

Language: Bash (for the loop script), Markdown (for templates)
Tools: Git, shell commands, AI agent CLIs (claude, codex)
Build system: N/A (templates for various project types)
Test framework: Project-specific (agents run their own tests)
Package manager: N/A

Project structure

docs/agent-harness/
├── README.md              # Quick overview and file purposes
├── AGENT-INSTRUCTIONS.md  # Template for agent system prompts
├── PROJECT-SPEC.md        # Template for project specifications
├── ralph-loop.sh          # The loop execution script
└── EXAMPLES.md            # Worked examples and best practices

Build & test commands

The harness itself doesn't have build/test commands, but agents using it must define them in their PROJECT-SPEC.md.

Coding standards

Markdown files use consistent formatting with headers, lists, code blocks
Bash scripts use set -euo pipefail for error handling
Templates include clear placeholders and examples
Documentation focuses on actionable, specific guidance

3. Requirements

Functional Requirements

FR-001: Project Specification Template

Description: A comprehensive template that captures all necessary project details for autonomous agent work.
Acceptance criteria:

Covers project overview, technical foundation, requirements, data models
Includes phasing for large projects
Provides reference materials and anti-patterns
Enables agents to work without human intervention

FR-002: Agent Instructions Template

Description: System prompt template that defines agent behavior, the core loop, and rules.
Acceptance criteria:

Defines senior engineer role with full codebase access
Specifies exact sequence: orient → plan → pick task → implement → verify → commit → exit
Includes output signals for loop control ( tags)
Enforces one-task-per-iteration rule

FR-003: Ralph Wiggum Loop Script

Description: Bash script that orchestrates agent iterations with fresh contexts.
Acceptance criteria:

Spawns fresh agent processes each iteration
Supports planning mode and build mode
Monitors output signals for completion/stuck/error states
Logs all iterations for debugging
Configurable max iterations and agent type

FR-004: Implementation Plan Management

Description: Dynamic task decomposition and tracking system.
Acceptance criteria:

Agents create IMPLEMENTATION_PLAN.md from project spec
Tasks ordered by dependency with checkboxes
Plan updated after each completed task
Git commits preserve plan history

FR-005: Quality Assurance Integration

Description: Mandatory build and test verification in each iteration.
Acceptance criteria:

Agents run project-specific build commands
All tests must pass before committing
Build failures prevent progression
Linting enforced if configured

Non-Functional Requirements

NFR-001: Simplicity

No complex dependencies or frameworks
Works with standard shell and git
Easy to copy templates into any project
Minimal setup required

NFR-002: Reliability

Fresh contexts prevent reasoning drift
Git history provides audit trail
Clear error signals for human intervention
Handles agent failures gracefully

NFR-003: Flexibility

Supports multiple AI agents (Claude, Codex, etc.)
Works with various project types and tech stacks
Configurable iteration limits and modes
Extensible for custom workflows

4. Data Model

The Agent Harness is documentation-focused, not data-focused. The "data" is the project files themselves.

Entities

Entity: Project Spec

Overview: what/why/success criteria
Technical foundation: stack, structure, commands
Requirements: functional/non-functional
Data model: project-specific entities
Architecture: constraints, decisions
Phasing: optional breakdown
References: docs, examples, anti-patterns

Entity: Implementation Plan

Tasks: discrete, testable, dependency-ordered
Status: checkbox per task
Notes: agent comments on stuck tasks
History: git commits track plan evolution

Entity: Agent Iteration

Context: fresh read of spec + plan + git log
Task: one unchecked item from plan
Changes: code modifications + tests
Verification: build + test results
Commit: descriptive message + plan update

Relationships

Project Spec → Implementation Plan (agent creates from spec)
Implementation Plan → Agent Iterations (one task per iteration)
Agent Iterations → Git Commits (each iteration commits changes)

5. API / Interface Design

The harness provides command-line interfaces:

ralph-loop.sh Commands

./ralph-loop.sh              # Build mode (default)
./ralph-loop.sh plan         # Planning mode
./ralph-loop.sh --max 20     # Limit iterations
./ralph-loop.sh --agent claude  # Specify agent

Template Files

PROJECT-SPEC.md: Fill with project details
AGENT.md: Copy from AGENT-INSTRUCTIONS.md
IMPLEMENTATION_PLAN.md: Generated by agent

Output Signals

Agents output special tags that the loop monitors:

<promise>PLANNED</promise>: Plan created
<promise>DONE</promise>: All tasks complete
<promise>STUCK</promise>: Needs human help
<promise>ERROR</promise>: Unrecoverable error

6. Architecture Decisions

Constraints

MUST: Use fresh agent contexts each iteration
MUST: One task per agent iteration
MUST: Mandatory build/test verification
MUST NOT: Allow context compaction or memory accumulation
PREFER: Git as the coordination mechanism
PREFER: Simple bash orchestration over complex frameworks

Dependencies

Git (version control)
AI agent CLI (claude, codex, etc.)
Shell environment (bash)
Project-specific build tools (npm, etc.)

Known Challenges

Context window limitations of AI agents
Maintaining coherence across iterations
Handling agent failures or stuck states
Balancing specificity vs flexibility in templates

7. Phasing (Optional)

The harness itself is complete in one phase, but projects using it should phase their work.

Phase 1: Foundation

Copy templates into project
Fill PROJECT-SPEC.md
Run planning mode to create IMPLEMENTATION_PLAN.md

Phase 2: Execution

Run build iterations until completion
Monitor for stuck/error signals
Intervene as needed

Phase 3: Refinement

Review final codebase
Update templates based on lessons learned
Document improvements for future use

8. Reference Materials

External docs

Geoffrey Huntley's Ralph Wiggum approach
Nate Jones task decomposition method
Ezward's sequential PRD style
OpenClaw sessions_spawn documentation

Existing code to learn from

ralph-loop.sh: Clean bash scripting with error handling
Templates: Structured markdown with clear sections
Examples: Real-world project specifications

Anti-patterns

Don't try to pass context between iterations
Don't let agents work on multiple tasks simultaneously
Don't skip build/test verification
Don't use complex orchestration when bash loop suffices
Don't make templates too rigid — they should be adapted per project

All Template Files and Their Roles

AGENT-INSTRUCTIONS.md

Role: System prompt template for the AI agent. Defines the senior engineer role, core workflow loop, strict rules, and output signals. Agents read this each iteration to understand their behavior.

Key Sections:

Role definition and capabilities
Core loop: orient → plan/pick → implement → verify → commit → exit
Rules: one task per iteration, mandatory testing, no over-engineering
Output signals: tags for loop control
Context management: fresh starts with git as memory

PROJECT-SPEC.md

Role: Comprehensive project definition template. The single source of truth that agents read every iteration. Captures all requirements, constraints, and context needed for autonomous work.

Key Sections:

Project overview (what, why, success criteria)
Technical foundation (stack, structure, commands)
Detailed requirements (functional + non-functional)
Data models and API design
Architecture decisions and constraints
Phasing and reference materials

ralph-loop.sh

Role: Bash script implementing the Ralph Wiggum Loop mechanism. Orchestrates agent iterations, monitors completion signals, handles errors, and maintains logs.

Key Features:

Fresh agent spawning each iteration
Planning mode vs build mode
Signal monitoring ( tags)
Configurable agents and iteration limits
Comprehensive logging

EXAMPLES.md

Role: Worked examples, comparisons of approaches, and best practices. Shows how to write good specs, compares different methodologies, and provides integration examples.

Key Content:

Comparison of Ezward/Ralph/Nate approaches
Complete FinPlan project spec example
Best practices for spec writing
OpenClaw integration examples

The Ralph Wiggum Loop Mechanism

The Ralph Wiggum Loop is named after the Simpsons character known for forgetting everything immediately, forcing fresh starts. This is the core innovation:

How It Works

Fresh Context Each Time: Every iteration spawns a completely new agent process with no accumulated context from previous runs.
Read-Only Memory: Agents rely on:
- PROJECT-SPEC.md (static requirements)
- IMPLEMENTATION_PLAN.md (current task status)
- Git log (recent changes)
- Codebase state
- Test results
One Task Per Iteration: Agents pick exactly one unchecked task, implement it completely, verify with build/tests, commit, and exit.
Signal-Based Control: Agents output tags that the bash loop monitors to determine next action.
Git as Coordination: Each iteration's changes are committed, creating an audit trail and allowing the next agent to see what was done.

Benefits

Prevents context window overflow
Eliminates stale reasoning problems
Enables indefinite project scaling
Provides clear intervention points
Maintains code quality through iteration

Flow Diagram

Start Loop
├── Read PROJECT-SPEC.md
├── Run Agent with Fresh Context
├── Agent: Orient (read plan, git log)
├── Agent: Pick ONE Task
├── Agent: Implement + Verify
├── Agent: Commit + Mark Done
├── Check Output Signals
├── If DONE: Exit Success
├── If STUCK/ERROR: Exit with Warning
└── Else: Loop Again

How to Use for Autonomous Coding Workflows

Quick Start

Copy templates into your project root
Fill out PROJECT-SPEC.md with complete project details
Run ./ralph-loop.sh plan to generate IMPLEMENTATION_PLAN.md
Run ./ralph-loop.sh to start autonomous building
Monitor progress; intervene if agent gets stuck

Detailed Workflow

Preparation:
- Choose project directory
- Copy all 4 template files
- Customize PROJECT-SPEC.md with your requirements
- Ensure build/test commands work
Planning Phase:
- Run ./ralph-loop.sh plan
- Agent reads spec and creates task decomposition
- Review IMPLEMENTATION_PLAN.md for completeness
Build Iterations:
- Run ./ralph-loop.sh --max 50 (or your preferred limit)
- Each iteration: fresh agent → one task → verify → commit
- Loop continues until DONE or max iterations
Monitoring:
- Check .ralph-logs/ for iteration details
- Look for STUCK/ERROR signals requiring intervention
- Review git log for progress
Intervention:
- If stuck: update IMPLEMENTATION_PLAN.md with notes
- If error: fix the issue and restart loop
- If plan needs changes: edit and restart

Configuration Options

--max N: Limit iterations (default 50)
--agent claude|codex: Choose AI agent
plan mode: Just create implementation plan

Examples and Use Cases

Personal Finance App (FinPlan)

Complete example in EXAMPLES.md showing:

Privacy-first local finance dashboard
Transaction import, categorization, projections
Monte Carlo retirement simulations
Tech stack: TypeScript, Express, SQLite, vanilla JS
15+ features decomposed into phases

Key Patterns from Examples

Be Specific: Acceptance criteria like "Parse QFX files and extract: date, amount, payee, memo, type"
Define Tech Stack: Don't let agents choose — specify "TypeScript, Express.js, SQLite"
Include Data Models: Explicit entity definitions with constraints
Phase Large Projects: Independent deployable phases
Anti-Patterns: "Don't use localStorage — SQLite is source of truth"

Use Cases

Complex Web Apps: Multi-feature applications with databases
Libraries/Frameworks: API design and implementation
Data Processing: ETL pipelines, analysis tools
CLI Tools: Command-line utilities with multiple commands
Prototypes to Production: Start with working prototype, iterate to full product

Integration with OpenClaw sessions_spawn

OpenClaw provides sessions_spawn for agent orchestration, offering an alternative to the bash loop.

Basic Usage

# Planning phase
sessions_spawn --task "Read PROJECT-SPEC.md. Decompose into tasks. Write IMPLEMENTATION_PLAN.md." --model opus

# Build iterations
sessions_spawn --task "Read AGENT.md. Follow core loop: pick one task, implement, test, commit." --model sonnet

Advanced Integration

Parallel Tasks: Spawn multiple agents for independent tasks
Different Models: Use opus for planning, sonnet for coding
Cron Scheduling: Automate iterations with cron jobs
Channel Output: Direct results to specific channels

Benefits Over Bash Loop

Model selection per task type
Parallel execution for independent work
Integration with OpenClaw's session management
Richer output formatting and notifications

When to Use Each

Ralph Loop: Simple sequential projects, bash environments
OpenClaw: Complex projects, parallel work, advanced features

Best Practices for Agent-Driven Development

Writing Project Specs

Be Exhaustively Specific: Include exact acceptance criteria, not vague requirements
Define Everything: Tech stack, directory structure, build commands, coding standards
Provide Examples: Sample data, API responses, UI mockups
Phase Appropriately: Break large projects into independent phases
Document Constraints: What MUST/MUST NOT do, plus preferences
Include Anti-Patterns: Lessons from previous attempts

Agent Instructions

Role Definition: Clear capabilities and limitations
Strict Rules: One task per iteration, mandatory testing, no refactoring unrelated code
Clear Signals: Use tags for loop control
Context Boundaries: Fresh start each time, rely on files/git

Loop Management

Monitor Logs: Check .ralph-logs/ for issues
Set Reasonable Limits: --max 20-50 iterations depending on project size
Plan Reviews: Always review IMPLEMENTATION_PLAN.md after planning phase
Intervention Ready: Be prepared to help when agents get stuck

Quality Assurance

Test Everything: Unit, integration, end-to-end tests
Build Verification: Every iteration must pass build
Code Standards: Lint, format, document consistently
Manual Reviews: Spot-check critical functionality

Scaling Up

Phase Work: Complete foundations before features
Parallel Execution: Use OpenClaw for independent tasks
Iterative Refinement: Start with working prototype, enhance gradually
Documentation Updates: Improve templates based on lessons learned

Common Pitfalls

Vague Specs: Leads to agent confusion and poor decomposition
Missing Build/Test: Code quality suffers without verification
Context Sharing: Don't try to pass state between iterations
Over-Parallelization: Dependencies must be respected
Ignoring Signals: STUCK/ERROR states need attention

This system transforms AI coding assistants from helpful sidekicks into autonomous development partners capable of delivering complete, tested software projects.

18 KiB Raw Permalink Blame History