docs(ops): add INCIDENT_LOG and link from RUNBOOK

This commit is contained in:
Paul Huliganga 2026-03-24 22:41:49 -04:00
parent 3e269a4d4c
commit 4c512a5161
2 changed files with 111 additions and 0 deletions

109
INCIDENT_LOG.md Normal file
View File

@ -0,0 +1,109 @@
# Incident Log — Recipe Manager Harness
Purpose: track operational failures, impact, root cause, and permanent fixes.
---
## Template
## [YYYY-MM-DD HH:MM TZ] Incident Title
- **Severity:** Low / Medium / High
- **Status:** Open / Mitigated / Resolved
- **Detected by:** Monitor / Human / Agent
- **Impact:**
- What stopped or degraded
- Duration
- **Symptoms:**
- Exact error text
- Observable behavior
- **Root cause:**
- Why it happened
- **Immediate mitigation:**
- What was done to restore service
- **Permanent fix:**
- Config/code/process changes
- **Verification:**
- How we confirmed it works
- **Prevention follow-up:**
- Guardrails/tests added
- **Links:**
- Commit(s):
- Related files:
- Session/cron IDs:
---
## Recorded Incidents
## [2026-03-24 08:00 EDT] Auto-iterator/monitor stalls due to model auth mismatch
- **Severity:** High
- **Status:** Resolved
- **Detected by:** Human
- **Impact:**
- Iterations stopped for ~10 hours
- No new recipe-manager commits during outage
- **Symptoms:**
- Cron failures: `No API key found for provider "openai"`
- Repeated job errors with no productive iteration
- **Root cause:**
- Cron jobs used `openai/...` model path (API-key provider) while environment was authenticated via `openai-codex` OAuth
- **Immediate mitigation:**
- Disabled broken jobs
- Manually spawned recovery iterations
- **Permanent fix:**
- Cron jobs updated to `openai-codex/gpt-5.3-codex`
- **Verification:**
- Iterations resumed and commits landed again
- **Prevention follow-up:**
- Runbook updated with provider-prefix rule
- **Links:**
- Related files: RUNBOOK.md
## [2026-03-24 21:40 EDT] Iteration skips due to stale session detection + wrong working dir
- **Severity:** High
- **Status:** Resolved
- **Detected by:** Human + monitor alerts
- **Impact:**
- Auto-iterator repeatedly skipped or produced STUCK responses
- **Symptoms:**
- `SKIP: iteration already running` with no new commit
- `STUCK: ... AGENT_INSTRUCTIONS.md and TODO.md missing from /workspace`
- **Root cause:**
- Stale completed sessions counted as active
- Iteration prompts sometimes lacked explicit project-root guard
- **Immediate mitigation:**
- Spawned manual iteration with absolute path + pre-flight checks
- **Permanent fix:**
- Added mandatory pre-flight guard in AGENT_INSTRUCTIONS.md
- Updated auto-iterator to require absolute path and freshness-based active-run detection
- **Verification:**
- New iterations completed successfully with commits:
- `87e9181` (import test)
- `276e03c` (import UI page/form)
- `d4aed47` (parsed preview)
- **Prevention follow-up:**
- Monitor updated to track `recipe-v1-iter*` labels for v1 phase
- **Links:**
- Commit(s): `37b17f7`, `d4aed47`, `276e03c`, `87e9181`
- Related files: AGENT_INSTRUCTIONS.md, TODO.md, RUNBOOK.md
## [2026-03-24 17:55 EDT] Docker validation blocked in runtime host
- **Severity:** Medium
- **Status:** Mitigated (manual follow-up required)
- **Detected by:** Agent
- **Impact:**
- Could not complete local docker deployment test from agent environment
- **Symptoms:**
- `docker: command not found`
- **Root cause:**
- Runtime host lacks Docker CLI/daemon
- **Immediate mitigation:**
- Marked task as manual host validation
- **Permanent fix:**
- Keep as explicit manual step in TODO for host with Docker installed
- **Verification:**
- Manual non-docker dev run validated separately
- **Prevention follow-up:**
- Documented as environment capability mismatch in RUNBOOK.md
- **Links:**
- Commit: `1a4b984`

View File

@ -184,3 +184,5 @@ If noisy, set monitor to every 1015 minutes.
---
This runbook should be updated whenever a new failure mode appears.
See also: `INCIDENT_LOG.md` for timestamped operational incidents and fixes.