docs(ops): add INCIDENT_LOG and link from RUNBOOK
This commit is contained in:
parent
3e269a4d4c
commit
4c512a5161
|
|
@ -0,0 +1,109 @@
|
|||
# Incident Log — Recipe Manager Harness
|
||||
|
||||
Purpose: track operational failures, impact, root cause, and permanent fixes.
|
||||
|
||||
---
|
||||
|
||||
## Template
|
||||
|
||||
## [YYYY-MM-DD HH:MM TZ] Incident Title
|
||||
- **Severity:** Low / Medium / High
|
||||
- **Status:** Open / Mitigated / Resolved
|
||||
- **Detected by:** Monitor / Human / Agent
|
||||
- **Impact:**
|
||||
- What stopped or degraded
|
||||
- Duration
|
||||
- **Symptoms:**
|
||||
- Exact error text
|
||||
- Observable behavior
|
||||
- **Root cause:**
|
||||
- Why it happened
|
||||
- **Immediate mitigation:**
|
||||
- What was done to restore service
|
||||
- **Permanent fix:**
|
||||
- Config/code/process changes
|
||||
- **Verification:**
|
||||
- How we confirmed it works
|
||||
- **Prevention follow-up:**
|
||||
- Guardrails/tests added
|
||||
- **Links:**
|
||||
- Commit(s):
|
||||
- Related files:
|
||||
- Session/cron IDs:
|
||||
|
||||
---
|
||||
|
||||
## Recorded Incidents
|
||||
|
||||
## [2026-03-24 08:00 EDT] Auto-iterator/monitor stalls due to model auth mismatch
|
||||
- **Severity:** High
|
||||
- **Status:** Resolved
|
||||
- **Detected by:** Human
|
||||
- **Impact:**
|
||||
- Iterations stopped for ~10 hours
|
||||
- No new recipe-manager commits during outage
|
||||
- **Symptoms:**
|
||||
- Cron failures: `No API key found for provider "openai"`
|
||||
- Repeated job errors with no productive iteration
|
||||
- **Root cause:**
|
||||
- Cron jobs used `openai/...` model path (API-key provider) while environment was authenticated via `openai-codex` OAuth
|
||||
- **Immediate mitigation:**
|
||||
- Disabled broken jobs
|
||||
- Manually spawned recovery iterations
|
||||
- **Permanent fix:**
|
||||
- Cron jobs updated to `openai-codex/gpt-5.3-codex`
|
||||
- **Verification:**
|
||||
- Iterations resumed and commits landed again
|
||||
- **Prevention follow-up:**
|
||||
- Runbook updated with provider-prefix rule
|
||||
- **Links:**
|
||||
- Related files: RUNBOOK.md
|
||||
|
||||
## [2026-03-24 21:40 EDT] Iteration skips due to stale session detection + wrong working dir
|
||||
- **Severity:** High
|
||||
- **Status:** Resolved
|
||||
- **Detected by:** Human + monitor alerts
|
||||
- **Impact:**
|
||||
- Auto-iterator repeatedly skipped or produced STUCK responses
|
||||
- **Symptoms:**
|
||||
- `SKIP: iteration already running` with no new commit
|
||||
- `STUCK: ... AGENT_INSTRUCTIONS.md and TODO.md missing from /workspace`
|
||||
- **Root cause:**
|
||||
- Stale completed sessions counted as active
|
||||
- Iteration prompts sometimes lacked explicit project-root guard
|
||||
- **Immediate mitigation:**
|
||||
- Spawned manual iteration with absolute path + pre-flight checks
|
||||
- **Permanent fix:**
|
||||
- Added mandatory pre-flight guard in AGENT_INSTRUCTIONS.md
|
||||
- Updated auto-iterator to require absolute path and freshness-based active-run detection
|
||||
- **Verification:**
|
||||
- New iterations completed successfully with commits:
|
||||
- `87e9181` (import test)
|
||||
- `276e03c` (import UI page/form)
|
||||
- `d4aed47` (parsed preview)
|
||||
- **Prevention follow-up:**
|
||||
- Monitor updated to track `recipe-v1-iter*` labels for v1 phase
|
||||
- **Links:**
|
||||
- Commit(s): `37b17f7`, `d4aed47`, `276e03c`, `87e9181`
|
||||
- Related files: AGENT_INSTRUCTIONS.md, TODO.md, RUNBOOK.md
|
||||
|
||||
## [2026-03-24 17:55 EDT] Docker validation blocked in runtime host
|
||||
- **Severity:** Medium
|
||||
- **Status:** Mitigated (manual follow-up required)
|
||||
- **Detected by:** Agent
|
||||
- **Impact:**
|
||||
- Could not complete local docker deployment test from agent environment
|
||||
- **Symptoms:**
|
||||
- `docker: command not found`
|
||||
- **Root cause:**
|
||||
- Runtime host lacks Docker CLI/daemon
|
||||
- **Immediate mitigation:**
|
||||
- Marked task as manual host validation
|
||||
- **Permanent fix:**
|
||||
- Keep as explicit manual step in TODO for host with Docker installed
|
||||
- **Verification:**
|
||||
- Manual non-docker dev run validated separately
|
||||
- **Prevention follow-up:**
|
||||
- Documented as environment capability mismatch in RUNBOOK.md
|
||||
- **Links:**
|
||||
- Commit: `1a4b984`
|
||||
|
|
@ -184,3 +184,5 @@ If noisy, set monitor to every 10–15 minutes.
|
|||
---
|
||||
|
||||
This runbook should be updated whenever a new failure mode appears.
|
||||
|
||||
See also: `INCIDENT_LOG.md` for timestamped operational incidents and fixes.
|
||||
|
|
|
|||
Loading…
Reference in New Issue