docs: TEST_HISTORY updated with Exp8 results and Exp9 plan

Exp8 results: 567 reward peak at step 60k, policy diverged after.
Best_model correctly saved. mini_monaco crashed at 91 steps (mean)
at same corner every time — throttle min=0.5 baked into action space.

Exp9 plan: throttle_min=0.2, v5 reward unchanged. Tests hypothesis
that v5 gradient is sufficient for hill without forced 0.5 minimum.

Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
This commit is contained in:
Paul Huliganga 2026-04-18 13:40:45 -04:00
parent 041481916d
commit eb4fd39056
1 changed files with 41 additions and 0 deletions

View File

@ -154,3 +154,44 @@ All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k step
3. **Test combined model** on all 4 tracks — can it generalise to mini_monaco like Trial 9 did? 3. **Test combined model** on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
4. **If yes:** We have reproduced Trial 9 reliably with a better reward function 4. **If yes:** We have reproduced Trial 9 reliably with a better reward function
### Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED
- **Reward:** v5 (speed × CTE-quality)
- **throttle_min:** 0.5
- **Method:** Direct model.learn() loop, single TCP connection, NO close_and_switch
- **Steps:** 90,000 total | 6,000 per segment | 15 checkpoints
- **Circle exploit fix:** Short-lap terminates episode immediately
- **Peak segment:** Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
- **Policy diverged:** Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
- **Checkpoints saved:** checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
- **Final eval results using best_model.zip (step 60k weights):**
| Track | Ep1 | Ep2 | Ep3 | Mean steps | Result |
|---|---|---|---|---|---|
| mountain_track (training) | 382 | 529 | 182 | 364 | ❌ crashes |
| generated_track (zero-shot) | 63 | 61 | 61 | 62 | ❌ crashes |
| mini_monaco (zero-shot) | 154 | 155 | 104 | 138 | ❌ crashes at one corner |
| generated_road (zero-shot) | 41 | 42 | 41 | 41 | ❌ crashes |
- **Throttle test:** mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
- **Key findings:**
1. ✅ Circle exploit fully eliminated — no short laps observed
2. ✅ Best model saving worked — captured step 60k peak, not step 90k drift
3. ✅ Genuine 20-22 second laps during training from step ~18k onward
4. ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
5. ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor
### Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)
- **Change from Exp8:** throttle_min: 0.5 → **0.2** (only change)
- **Reward:** v5 (speed × CTE-quality) — UNCHANGED
- **Hypothesis:** v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero).
Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0]
and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
- **What we never tested:** (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
- **Risk:** Model may still stall on hill if gradient convergence is slow in early training.
StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
- **Next test (Exp10):** Add track_progress bonus to reward (v6) — one variable at a time.
- **Save dir:** models/exp9-mountain-v5-throttle02/
- **Watch:** tail -f /tmp/exp9.log