docs: TEST_HISTORY updated with Exp8 results and Exp9 plan

Exp8 results: 567 reward peak at step 60k, policy diverged after. Best_model correctly saved. mini_monaco crashed at 91 steps (mean) at same corner every time — throttle min=0.5 baked into action space. Exp9 plan: throttle_min=0.2, v5 reward unchanged. Tests hypothesis that v5 gradient is sufficient for hill without forced 0.5 minimum. Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A
2026-04-18 13:40:45 -04:00 · 2026-04-18 13:40:45 -04:00 · eb4fd39056
parent 041481916d
commit eb4fd39056
1 changed files with 41 additions and 0 deletions
--- a/docs/TEST_HISTORY.md
+++ b/docs/TEST_HISTORY.md
@ -154,3 +154,44 @@ All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k step
 3. **Test combined model** on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
 4. **If yes:** We have reproduced Trial 9 reliably with a better reward function
 ### Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED
 - **Reward:** v5 (speed × CTE-quality)
 - **throttle_min:** 0.5
 - **Method:** Direct model.learn() loop, single TCP connection, NO close_and_switch
 - **Steps:** 90,000 total | 6,000 per segment | 15 checkpoints
 - **Circle exploit fix:** Short-lap terminates episode immediately
 - **Peak segment:** Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
 - **Policy diverged:** Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
 - **Checkpoints saved:** checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
 - **Final eval results using best_model.zip (step 60k weights):**
 | Track | Ep1 | Ep2 | Ep3 | Mean steps | Result |
 |---|---|---|---|---|---|
 | mountain_track (training) | 382 | 529 | 182 | 364 | ❌ crashes |
 | generated_track (zero-shot) | 63 | 61 | 61 | 62 | ❌ crashes |
 | mini_monaco (zero-shot) | 154 | 155 | 104 | 138 | ❌ crashes at one corner |
 | generated_road (zero-shot) | 41 | 42 | 41 | 41 | ❌ crashes |
 - **Throttle test:** mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
 - **Key findings:**
  1. ✅ Circle exploit fully eliminated — no short laps observed
  2. ✅ Best model saving worked — captured step 60k peak, not step 90k drift
  3. ✅ Genuine 20-22 second laps during training from step ~18k onward
  4. ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
  5. ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
  6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor
 ### Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)
 - **Change from Exp8:** throttle_min: 0.5 → **0.2** (only change)
 - **Reward:** v5 (speed × CTE-quality) — UNCHANGED
 - **Hypothesis:** v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero). 
  Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0]
  and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
 - **What we never tested:** (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
 - **Risk:** Model may still stall on hill if gradient convergence is slow in early training.
  StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
 - **Next test (Exp10):** Add track_progress bonus to reward (v6) — one variable at a time.
 - **Save dir:** models/exp9-mountain-v5-throttle02/
 - **Watch:** tail -f /tmp/exp9.log