recipe-manager/RUNBOOK.md

6.9 KiB
Raw Permalink Blame History

Recipe Manager Agentic Runbook

Last updated: 2026-03-24

Purpose

Operational guide for running the Recipe Manager agent harness reliably.


Core Execution Model

  • One task per iteration
  • One commit per iteration
  • TODO.md is the authoritative queue
  • Work only in: /home/paulh/.openclaw/workspace/projects/recipe-manager

Required Guards (Must Pass Before Coding)

Pre-flight checks

Before any iteration starts, verify these files exist:

  • AGENT_INSTRUCTIONS.md
  • TODO.md

If missing, fail with: STUCK: bad working dir or missing harness files at /home/paulh/.openclaw/workspace/projects/recipe-manager


Monitoring Signals (How we know it's working)

A run is healthy only when all 3 are true:

  1. Active session updated recently (recipe-v1-iter*)
  2. New git commits are landing
  3. TODO checkboxes advance

Known Failure Modes and Fixes

1) Wrong working directory

Symptom

Agent says AGENT_INSTRUCTIONS.md / TODO.md missing in /workspace.

Root cause

Spawner started outside project root.

Fix

  • Force absolute project path in every task prompt
  • Add mandatory pre-flight guard
  • Relaunch fresh iteration

2) False “iteration already running”

Symptom

Auto-iterator repeatedly prints SKIP even when no coding progress occurs.

Root cause

It treated stale historical sessions as active.

Fix

  • Treat a session as active only if updated recently (freshness window)
  • Use current phase labels only (recipe-v1-iter*)

3) Label mismatch across phases

Symptom

Monitor reports wrong status or misses active runs.

Root cause

MVP labels (recipe-mvp-*) used during v1 phase.

Fix

  • Update monitor + iterator to phase-specific labels
  • Standardize naming per phase:
    • MVP: recipe-mvp-iter*
    • v1: recipe-v1-iter*

4) Model/provider auth mismatch

Symptom

Cron jobs fail with:

  • No API key found for provider openai
  • or Copilot cooldown rate-limit errors

Root cause

Using openai/... models without OpenAI API key.

Fix

  • Use OAuth provider model prefix: openai-codex/...
  • For this project, prefer: openai-codex/gpt-5.3-codex

5) Environment capability mismatch (Docker)

Symptom

Task fails with docker: command not found.

Root cause

Agent runtime host lacks Docker.

Fix

  • Mark as manual host validation task
  • Continue with unblocked tasks

6) Runtime module mismatch (ESM/CommonJS)

Symptom

Backend runtime error: require is not defined.

Root cause

Using require() in ESM code path.

Fix

  • Replace require('fs') calls with ESM imports (writeFileSync)
  • Build + rerun server

Operational Controls

Pause automation

Disable both jobs:

  • Recipe Manager Auto-Iterator
  • Recipe Manager Progress Monitor

Resume automation

Enable both jobs, then manually kick one fresh iteration.

Manual override iteration (safe restart)

Spawn one explicit iteration with:

  • absolute project path
  • pre-flight guard
  • one-task/one-commit rule

Workflow Periodic Execution (cron + systemd)

All commands assume project root: /home/paulh/.openclaw/workspace/projects/recipe-manager

Manual commands

# Resume from checkpoint (default mode)
npm run workflow:run

# Force restart from stage 1
npm run workflow:run -- --mode restart

# Scheduled run entrypoint (resume + morning report)
npm run workflow:schedule

# Health signal for automation (0=healthy, 1=failed/blocked/unknown)
npm run workflow:health-check

Cron example

Run scheduler every 15 minutes, health check every 5 minutes:

*/15 * * * * cd /home/paulh/.openclaw/workspace/projects/recipe-manager && /usr/bin/npm run workflow:schedule >> /home/paulh/.openclaw/workspace/projects/recipe-manager/status/workflow-schedule.log 2>&1
*/5 * * * * cd /home/paulh/.openclaw/workspace/projects/recipe-manager && /usr/bin/npm run workflow:health-check >> /home/paulh/.openclaw/workspace/projects/recipe-manager/status/workflow-health.log 2>&1

systemd example

Create one-shot services and timers:

/etc/systemd/system/recipe-workflow-schedule.service

[Unit]
Description=Recipe Manager scheduled workflow run
After=network.target

[Service]
Type=oneshot
WorkingDirectory=/home/paulh/.openclaw/workspace/projects/recipe-manager
ExecStart=/usr/bin/npm run workflow:schedule

/etc/systemd/system/recipe-workflow-schedule.timer

[Unit]
Description=Run Recipe Manager scheduled workflow every 15 minutes

[Timer]
OnCalendar=*:0/15
Persistent=true

[Install]
WantedBy=timers.target

/etc/systemd/system/recipe-workflow-health.service

[Unit]
Description=Recipe Manager workflow health check
After=network.target

[Service]
Type=oneshot
WorkingDirectory=/home/paulh/.openclaw/workspace/projects/recipe-manager
ExecStart=/usr/bin/npm run workflow:health-check

/etc/systemd/system/recipe-workflow-health.timer

[Unit]
Description=Run Recipe Manager workflow health check every 5 minutes

[Timer]
OnCalendar=*:0/5
Persistent=true

[Install]
WantedBy=timers.target

Enable timers:

sudo systemctl daemon-reload
sudo systemctl enable --now recipe-workflow-schedule.timer recipe-workflow-health.timer

Troubleshooting failed/blocked status

When npm run workflow:health-check returns exit code 1 with {"status":"failed"} or {"status":"blocked"}:

  1. Check current workflow status payload:
    cat status/workflow-status.json
    
  2. Check recent progress log entries:
    tail -n 50 status/workflow-progress.jsonl
    
  3. Retry from checkpoint:
    npm run workflow:run
    
  4. If still blocked/failed, force a clean restart:
    npm run workflow:run -- --mode restart
    
  5. Re-run health check and confirm healthy output (idle, running, or completed):
    npm run workflow:health-check
    

If status file is missing or malformed, the health check prints status_read_failed and exits 1; regenerate state with npm run workflow:run -- --mode restart.


Completion Definition

A phase is complete when:

  1. No unchecked tasks remain in that phase section of TODO.md
  2. Latest iteration exits without STUCK/ERROR
  3. Commit + TODO update are present

  • Auto-iterator: every 15 minutes
  • Progress monitor: every 5 minutes (high visibility mode)

If noisy, set monitor to every 1015 minutes.


Handoff Checklist (Before ending a session)

  • Confirm latest commit hash
  • Confirm active phase + next unchecked task
  • Confirm auto-iterator enabled/disabled status
  • Confirm monitor enabled/disabled status
  • Confirm no stale active-session false positives

Quick Status Commands

Latest commit

git log -1 --oneline

Next tasks

grep -n "^- \[ \]" TODO.md | head

Recent progress

git log --oneline -5


This runbook should be updated whenever a new failure mode appears.

See also: INCIDENT_LOG.md for timestamped operational incidents and fixes.