From 3e269a4d4c8381a911aeb72627741f99c7a70f8b Mon Sep 17 00:00:00 2001 From: Paul Huliganga Date: Tue, 24 Mar 2026 22:39:05 -0400 Subject: [PATCH] docs(runbook): add agent harness failure modes and recovery guide --- RUNBOOK.md | 186 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 186 insertions(+) create mode 100644 RUNBOOK.md diff --git a/RUNBOOK.md b/RUNBOOK.md new file mode 100644 index 0000000..4ff079c --- /dev/null +++ b/RUNBOOK.md @@ -0,0 +1,186 @@ +# Recipe Manager Agentic Runbook + +Last updated: 2026-03-24 + +## Purpose +Operational guide for running the Recipe Manager agent harness reliably. + +--- + +## Core Execution Model + +- One task per iteration +- One commit per iteration +- TODO.md is the authoritative queue +- Work only in: + `/home/paulh/.openclaw/workspace/projects/recipe-manager` + +--- + +## Required Guards (Must Pass Before Coding) + +### Pre-flight checks +Before any iteration starts, verify these files exist: +- `AGENT_INSTRUCTIONS.md` +- `TODO.md` + +If missing, fail with: +`STUCK: bad working dir or missing harness files at /home/paulh/.openclaw/workspace/projects/recipe-manager` + +--- + +## Monitoring Signals (How we know it's working) + +A run is healthy only when all 3 are true: +1. Active session updated recently (`recipe-v1-iter*`) +2. New git commits are landing +3. TODO checkboxes advance + +--- + +## Known Failure Modes and Fixes + +## 1) Wrong working directory +### Symptom +Agent says AGENT_INSTRUCTIONS.md / TODO.md missing in `/workspace`. + +### Root cause +Spawner started outside project root. + +### Fix +- Force absolute project path in every task prompt +- Add mandatory pre-flight guard +- Relaunch fresh iteration + +--- + +## 2) False “iteration already running” +### Symptom +Auto-iterator repeatedly prints SKIP even when no coding progress occurs. + +### Root cause +It treated stale historical sessions as active. + +### Fix +- Treat a session as active only if updated recently (freshness window) +- Use current phase labels only (`recipe-v1-iter*`) + +--- + +## 3) Label mismatch across phases +### Symptom +Monitor reports wrong status or misses active runs. + +### Root cause +MVP labels (`recipe-mvp-*`) used during v1 phase. + +### Fix +- Update monitor + iterator to phase-specific labels +- Standardize naming per phase: + - MVP: `recipe-mvp-iter*` + - v1: `recipe-v1-iter*` + +--- + +## 4) Model/provider auth mismatch +### Symptom +Cron jobs fail with: +- `No API key found for provider openai` +- or Copilot cooldown rate-limit errors + +### Root cause +Using `openai/...` models without OpenAI API key. + +### Fix +- Use OAuth provider model prefix: `openai-codex/...` +- For this project, prefer: + `openai-codex/gpt-5.3-codex` + +--- + +## 5) Environment capability mismatch (Docker) +### Symptom +Task fails with `docker: command not found`. + +### Root cause +Agent runtime host lacks Docker. + +### Fix +- Mark as manual host validation task +- Continue with unblocked tasks + +--- + +## 6) Runtime module mismatch (ESM/CommonJS) +### Symptom +Backend runtime error: `require is not defined`. + +### Root cause +Using `require()` in ESM code path. + +### Fix +- Replace `require('fs')` calls with ESM imports (`writeFileSync`) +- Build + rerun server + +--- + +## Operational Controls + +## Pause automation +Disable both jobs: +- Recipe Manager Auto-Iterator +- Recipe Manager Progress Monitor + +## Resume automation +Enable both jobs, then manually kick one fresh iteration. + +## Manual override iteration (safe restart) +Spawn one explicit iteration with: +- absolute project path +- pre-flight guard +- one-task/one-commit rule + +--- + +## Completion Definition + +A phase is complete when: +1. No unchecked tasks remain in that phase section of TODO.md +2. Latest iteration exits without STUCK/ERROR +3. Commit + TODO update are present + +--- + +## Recommended Cadence + +- Auto-iterator: every 15 minutes +- Progress monitor: every 5 minutes (high visibility mode) + +If noisy, set monitor to every 10–15 minutes. + +--- + +## Handoff Checklist (Before ending a session) + +- [ ] Confirm latest commit hash +- [ ] Confirm active phase + next unchecked task +- [ ] Confirm auto-iterator enabled/disabled status +- [ ] Confirm monitor enabled/disabled status +- [ ] Confirm no stale active-session false positives + +--- + +## Quick Status Commands + +### Latest commit +`git log -1 --oneline` + +### Next tasks +`grep -n "^- \[ \]" TODO.md | head` + +### Recent progress +`git log --oneline -5` + +--- + +This runbook should be updated whenever a new failure mode appears.