docs: ADR-017 — always save best model, never just latest

Documents the root cause of losing the mountain_track model that was doing 20-second laps at step 30k but crashed at step 90k final eval. Phase 2 (13k steps, simple track): final = best. Assumption carried forward incorrectly into Wave 4 (90k steps, policy can drift). Mandatory rule: every training script uses train_multitrack() best_model tracking OR SB3 EvalCallback. No exceptions. Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A
2026-04-17 16:03:59 -04:00 · 2026-04-17 16:03:59 -04:00 · fc01057c14
parent 4f77b8a468
commit fc01057c14
1 changed files with 35 additions and 0 deletions
--- a/DECISIONS.md
+++ b/DECISIONS.md
@ -315,3 +315,38 @@ GP tunes learning_rate — a higher LR will more aggressively overwrite speciali

 **Why:** Multiple fixes (90k step cap, checkpointing, rescue eval) were committed but the controller was never restarted. The fixes never ran. This caused several more timeouts that the fixes were meant to prevent.

+
+## ADR-017: Always Save the BEST Model During Training, Never Just the Latest
+
+**Date:** 2026-04-17
+**Status:** Active — enforced
+
+**Decision:** Every training script must save the best model found during
+training, not just the final weights. Two mechanisms are approved:
+
+1. `train_multitrack()` in multitrack_runner.py — tracks `best_segment_reward`,
+   saves `best_model.zip` on every new high score, reloads it at the end.
+2. SB3 `EvalCallback(best_model_save_path=..., deterministic=True)` for
+   standalone scripts.
+
+No training script may be written or run without one of these two mechanisms.
+
+**Why this matters:**
+PPO policy weights can and do drift during long training runs. A model that
+could drive at step 30k may be broken at step 90k. Saving only the final
+weights throws away the best model found during training.
+
+**What was lost because this wasn't in place:**
+- Wave 4 mountain_track Exp3/4/5: model was doing 20-second laps at step 30k.
+  Final model at step 90k crashed in 13 steps. Irrecoverable.
+- Untold mid-training peaks across Wave 3 and Wave 4 that were never captured.
+
+**Root cause of the oversight:**
+Phase 2 autoresearch used 13k-step trials on a simple single track. The
+final model happened to be the best model (no time to drift). This false
+assumption was carried forward into longer multi-track training where it
+was wrong. The word "checkpoint" was misleading — we were saving the latest,
+not the best.
+
+**Implementation:** See `train_multitrack()` in multitrack_runner.py — the
+`best_segment_reward` tracking and `best_model.zip` save logic added 2026-04-17.