recipe-manager/INCIDENT_LOG.md

3.5 KiB

Incident Log — Recipe Manager Harness

Purpose: track operational failures, impact, root cause, and permanent fixes.


Template

[YYYY-MM-DD HH:MM TZ] Incident Title

  • Severity: Low / Medium / High
  • Status: Open / Mitigated / Resolved
  • Detected by: Monitor / Human / Agent
  • Impact:
    • What stopped or degraded
    • Duration
  • Symptoms:
    • Exact error text
    • Observable behavior
  • Root cause:
    • Why it happened
  • Immediate mitigation:
    • What was done to restore service
  • Permanent fix:
    • Config/code/process changes
  • Verification:
    • How we confirmed it works
  • Prevention follow-up:
    • Guardrails/tests added
  • Links:
    • Commit(s):
    • Related files:
    • Session/cron IDs:

Recorded Incidents

[2026-03-24 08:00 EDT] Auto-iterator/monitor stalls due to model auth mismatch

  • Severity: High
  • Status: Resolved
  • Detected by: Human
  • Impact:
    • Iterations stopped for ~10 hours
    • No new recipe-manager commits during outage
  • Symptoms:
    • Cron failures: No API key found for provider "openai"
    • Repeated job errors with no productive iteration
  • Root cause:
    • Cron jobs used openai/... model path (API-key provider) while environment was authenticated via openai-codex OAuth
  • Immediate mitigation:
    • Disabled broken jobs
    • Manually spawned recovery iterations
  • Permanent fix:
    • Cron jobs updated to openai-codex/gpt-5.3-codex
  • Verification:
    • Iterations resumed and commits landed again
  • Prevention follow-up:
    • Runbook updated with provider-prefix rule
  • Links:
    • Related files: RUNBOOK.md

[2026-03-24 21:40 EDT] Iteration skips due to stale session detection + wrong working dir

  • Severity: High
  • Status: Resolved
  • Detected by: Human + monitor alerts
  • Impact:
    • Auto-iterator repeatedly skipped or produced STUCK responses
  • Symptoms:
    • SKIP: iteration already running with no new commit
    • STUCK: ... AGENT_INSTRUCTIONS.md and TODO.md missing from /workspace
  • Root cause:
    • Stale completed sessions counted as active
    • Iteration prompts sometimes lacked explicit project-root guard
  • Immediate mitigation:
    • Spawned manual iteration with absolute path + pre-flight checks
  • Permanent fix:
    • Added mandatory pre-flight guard in AGENT_INSTRUCTIONS.md
    • Updated auto-iterator to require absolute path and freshness-based active-run detection
  • Verification:
    • New iterations completed successfully with commits:
      • 87e9181 (import test)
      • 276e03c (import UI page/form)
      • d4aed47 (parsed preview)
  • Prevention follow-up:
    • Monitor updated to track recipe-v1-iter* labels for v1 phase
  • Links:
    • Commit(s): 37b17f7, d4aed47, 276e03c, 87e9181
    • Related files: AGENT_INSTRUCTIONS.md, TODO.md, RUNBOOK.md

[2026-03-24 17:55 EDT] Docker validation blocked in runtime host

  • Severity: Medium
  • Status: Mitigated (manual follow-up required)
  • Detected by: Agent
  • Impact:
    • Could not complete local docker deployment test from agent environment
  • Symptoms:
    • docker: command not found
  • Root cause:
    • Runtime host lacks Docker CLI/daemon
  • Immediate mitigation:
    • Marked task as manual host validation
  • Permanent fix:
    • Keep as explicit manual step in TODO for host with Docker installed
  • Verification:
    • Manual non-docker dev run validated separately
  • Prevention follow-up:
    • Documented as environment capability mismatch in RUNBOOK.md
  • Links:
    • Commit: 1a4b984