docs(runbook): add agent harness failure modes and recovery guide
This commit is contained in:
parent
d4aed475a2
commit
3e269a4d4c
|
|
@ -0,0 +1,186 @@
|
||||||
|
# Recipe Manager Agentic Runbook
|
||||||
|
|
||||||
|
Last updated: 2026-03-24
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
Operational guide for running the Recipe Manager agent harness reliably.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Execution Model
|
||||||
|
|
||||||
|
- One task per iteration
|
||||||
|
- One commit per iteration
|
||||||
|
- TODO.md is the authoritative queue
|
||||||
|
- Work only in:
|
||||||
|
`/home/paulh/.openclaw/workspace/projects/recipe-manager`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Required Guards (Must Pass Before Coding)
|
||||||
|
|
||||||
|
### Pre-flight checks
|
||||||
|
Before any iteration starts, verify these files exist:
|
||||||
|
- `AGENT_INSTRUCTIONS.md`
|
||||||
|
- `TODO.md`
|
||||||
|
|
||||||
|
If missing, fail with:
|
||||||
|
`STUCK: bad working dir or missing harness files at /home/paulh/.openclaw/workspace/projects/recipe-manager`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Signals (How we know it's working)
|
||||||
|
|
||||||
|
A run is healthy only when all 3 are true:
|
||||||
|
1. Active session updated recently (`recipe-v1-iter*`)
|
||||||
|
2. New git commits are landing
|
||||||
|
3. TODO checkboxes advance
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Known Failure Modes and Fixes
|
||||||
|
|
||||||
|
## 1) Wrong working directory
|
||||||
|
### Symptom
|
||||||
|
Agent says AGENT_INSTRUCTIONS.md / TODO.md missing in `/workspace`.
|
||||||
|
|
||||||
|
### Root cause
|
||||||
|
Spawner started outside project root.
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
- Force absolute project path in every task prompt
|
||||||
|
- Add mandatory pre-flight guard
|
||||||
|
- Relaunch fresh iteration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2) False “iteration already running”
|
||||||
|
### Symptom
|
||||||
|
Auto-iterator repeatedly prints SKIP even when no coding progress occurs.
|
||||||
|
|
||||||
|
### Root cause
|
||||||
|
It treated stale historical sessions as active.
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
- Treat a session as active only if updated recently (freshness window)
|
||||||
|
- Use current phase labels only (`recipe-v1-iter*`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3) Label mismatch across phases
|
||||||
|
### Symptom
|
||||||
|
Monitor reports wrong status or misses active runs.
|
||||||
|
|
||||||
|
### Root cause
|
||||||
|
MVP labels (`recipe-mvp-*`) used during v1 phase.
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
- Update monitor + iterator to phase-specific labels
|
||||||
|
- Standardize naming per phase:
|
||||||
|
- MVP: `recipe-mvp-iter*`
|
||||||
|
- v1: `recipe-v1-iter*`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4) Model/provider auth mismatch
|
||||||
|
### Symptom
|
||||||
|
Cron jobs fail with:
|
||||||
|
- `No API key found for provider openai`
|
||||||
|
- or Copilot cooldown rate-limit errors
|
||||||
|
|
||||||
|
### Root cause
|
||||||
|
Using `openai/...` models without OpenAI API key.
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
- Use OAuth provider model prefix: `openai-codex/...`
|
||||||
|
- For this project, prefer:
|
||||||
|
`openai-codex/gpt-5.3-codex`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5) Environment capability mismatch (Docker)
|
||||||
|
### Symptom
|
||||||
|
Task fails with `docker: command not found`.
|
||||||
|
|
||||||
|
### Root cause
|
||||||
|
Agent runtime host lacks Docker.
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
- Mark as manual host validation task
|
||||||
|
- Continue with unblocked tasks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6) Runtime module mismatch (ESM/CommonJS)
|
||||||
|
### Symptom
|
||||||
|
Backend runtime error: `require is not defined`.
|
||||||
|
|
||||||
|
### Root cause
|
||||||
|
Using `require()` in ESM code path.
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
- Replace `require('fs')` calls with ESM imports (`writeFileSync`)
|
||||||
|
- Build + rerun server
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Operational Controls
|
||||||
|
|
||||||
|
## Pause automation
|
||||||
|
Disable both jobs:
|
||||||
|
- Recipe Manager Auto-Iterator
|
||||||
|
- Recipe Manager Progress Monitor
|
||||||
|
|
||||||
|
## Resume automation
|
||||||
|
Enable both jobs, then manually kick one fresh iteration.
|
||||||
|
|
||||||
|
## Manual override iteration (safe restart)
|
||||||
|
Spawn one explicit iteration with:
|
||||||
|
- absolute project path
|
||||||
|
- pre-flight guard
|
||||||
|
- one-task/one-commit rule
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Completion Definition
|
||||||
|
|
||||||
|
A phase is complete when:
|
||||||
|
1. No unchecked tasks remain in that phase section of TODO.md
|
||||||
|
2. Latest iteration exits without STUCK/ERROR
|
||||||
|
3. Commit + TODO update are present
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Cadence
|
||||||
|
|
||||||
|
- Auto-iterator: every 15 minutes
|
||||||
|
- Progress monitor: every 5 minutes (high visibility mode)
|
||||||
|
|
||||||
|
If noisy, set monitor to every 10–15 minutes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Handoff Checklist (Before ending a session)
|
||||||
|
|
||||||
|
- [ ] Confirm latest commit hash
|
||||||
|
- [ ] Confirm active phase + next unchecked task
|
||||||
|
- [ ] Confirm auto-iterator enabled/disabled status
|
||||||
|
- [ ] Confirm monitor enabled/disabled status
|
||||||
|
- [ ] Confirm no stale active-session false positives
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Status Commands
|
||||||
|
|
||||||
|
### Latest commit
|
||||||
|
`git log -1 --oneline`
|
||||||
|
|
||||||
|
### Next tasks
|
||||||
|
`grep -n "^- \[ \]" TODO.md | head`
|
||||||
|
|
||||||
|
### Recent progress
|
||||||
|
`git log --oneline -5`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
This runbook should be updated whenever a new failure mode appears.
|
||||||
Loading…
Reference in New Issue