docs: TEST_HISTORY updated with Exp8 results and Exp9 plan
Exp8 results: 567 reward peak at step 60k, policy diverged after. Best_model correctly saved. mini_monaco crashed at 91 steps (mean) at same corner every time — throttle min=0.5 baked into action space. Exp9 plan: throttle_min=0.2, v5 reward unchanged. Tests hypothesis that v5 gradient is sufficient for hill without forced 0.5 minimum. Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A
This commit is contained in:
parent
041481916d
commit
eb4fd39056
|
|
@ -154,3 +154,44 @@ All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k step
|
||||||
3. **Test combined model** on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
|
3. **Test combined model** on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
|
||||||
4. **If yes:** We have reproduced Trial 9 reliably with a better reward function
|
4. **If yes:** We have reproduced Trial 9 reliably with a better reward function
|
||||||
|
|
||||||
|
|
||||||
|
### Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED
|
||||||
|
- **Reward:** v5 (speed × CTE-quality)
|
||||||
|
- **throttle_min:** 0.5
|
||||||
|
- **Method:** Direct model.learn() loop, single TCP connection, NO close_and_switch
|
||||||
|
- **Steps:** 90,000 total | 6,000 per segment | 15 checkpoints
|
||||||
|
- **Circle exploit fix:** Short-lap terminates episode immediately
|
||||||
|
- **Peak segment:** Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
|
||||||
|
- **Policy diverged:** Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
|
||||||
|
- **Checkpoints saved:** checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
|
||||||
|
- **Final eval results using best_model.zip (step 60k weights):**
|
||||||
|
|
||||||
|
| Track | Ep1 | Ep2 | Ep3 | Mean steps | Result |
|
||||||
|
|---|---|---|---|---|---|
|
||||||
|
| mountain_track (training) | 382 | 529 | 182 | 364 | ❌ crashes |
|
||||||
|
| generated_track (zero-shot) | 63 | 61 | 61 | 62 | ❌ crashes |
|
||||||
|
| mini_monaco (zero-shot) | 154 | 155 | 104 | 138 | ❌ crashes at one corner |
|
||||||
|
| generated_road (zero-shot) | 41 | 42 | 41 | 41 | ❌ crashes |
|
||||||
|
|
||||||
|
- **Throttle test:** mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
|
||||||
|
- **Key findings:**
|
||||||
|
1. ✅ Circle exploit fully eliminated — no short laps observed
|
||||||
|
2. ✅ Best model saving worked — captured step 60k peak, not step 90k drift
|
||||||
|
3. ✅ Genuine 20-22 second laps during training from step ~18k onward
|
||||||
|
4. ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
|
||||||
|
5. ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
|
||||||
|
6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor
|
||||||
|
|
||||||
|
### Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)
|
||||||
|
- **Change from Exp8:** throttle_min: 0.5 → **0.2** (only change)
|
||||||
|
- **Reward:** v5 (speed × CTE-quality) — UNCHANGED
|
||||||
|
- **Hypothesis:** v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero).
|
||||||
|
Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0]
|
||||||
|
and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
|
||||||
|
- **What we never tested:** (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
|
||||||
|
- **Risk:** Model may still stall on hill if gradient convergence is slow in early training.
|
||||||
|
StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
|
||||||
|
- **Next test (Exp10):** Add track_progress bonus to reward (v6) — one variable at a time.
|
||||||
|
- **Save dir:** models/exp9-mountain-v5-throttle02/
|
||||||
|
- **Watch:** tail -f /tmp/exp9.log
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue