11 KiB
11 KiB
Test History — DonkeyCar RL Autoresearch
Last updated: 2026-04-18
This document records every significant training experiment, what was changed, what was observed, and what was learned. Use this to make methodical decisions rather than random changes.
Baseline Models (Phase 1 & 2)
Phase 2 Champion
- Model:
models/champion/model.zip - Track trained on: generated_road only
- Steps: 13,328
- Hyperparams: lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward
- Result: ✅ Drives generated_road perfectly, stays in right lane
- Zero-shot: ❌ Fails on generated_track (confirmed), ❌ Fails on mini_monaco
- Notes: Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps)
Mountain Track Experiments
All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps
Exp 1 — Mountain track, old v4 reward, throttle_min=0.2
- Reward: v4 (CTE × efficiency × speed)
- throttle_min: 0.2
- Key observation: Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power
- Root cause: v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle")
- Learned: v4 reward is broken for inclined terrain
Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k
- Reward: v4
- throttle_min: 0.2
- Key observation: Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn
- Result: ❌ Killed early — no improvement
- Learned: More steps alone cannot fix a broken reward signal
Exp 3 — Mountain track, old v4 reward, throttle_min=0.5
- Reward: v4
- throttle_min: 0.5 (increased to overcome hill)
- Key observation: Circle exploit dominated entire run — 0.5-1.75 second laps throughout
- Lap times logged: All short (exploit)
- Result: ❌ Model useless (reward=4.99 after 90k steps)
- Learned: Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high
Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5
- Reward: v4
- throttle_min: 0.5
- Key observation: Killed early — same 2 behaviors (left into barrier, straight into barrier)
- Result: ❌ Killed
- Learned: Continuing bad training does not help
Exp 5 — Mountain track, v5 reward, throttle_min=0.5 ⭐ KEY EXPERIMENT
- Reward: v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills
- throttle_min: 0.5
- Method: Direct model.learn() — NO train_multitrack(), ONE connection throughout
- Key observation: Genuine 20-22 second laps appearing from step ~30,000 onward
- Lap times: 19-22 seconds (genuine), consistently for 60k steps
- Result: ❌ Final model poor — best model was at step ~30k but we only saved final (step 90k) model
- Root cause of failure: No best-model saving. Policy peaked at 30k, diverged by 90k
- Learned:
- v5 reward WORKS for mountain track
- throttle_min=0.5 WORKS for hill
- Direct model.learn() (no track switching) avoids phantom car issues
- MUST save best model during training, not just final
Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment)
- Reward: v5
- throttle_min: 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments)
- Method: train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint)
- Key observation: Circle exploit dominated — only 0.5-1.75 second laps throughout
- Result: ❌ Only 1 checkpoint saved (at step 90k). Best reward=4.99
- Root cause: Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation)
- Learned: steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing
Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed
- Reward: v5 + short-lap now TERMINATES episode (not just penalty)
- throttle_min: 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default)
- Method: train_multitrack() with steps_per_switch=6000 (15 segments)
- Key observation: Car in LEFT lane, sitting doing nothing. Not normal spawn position.
- Hypothesis: Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car.
- Result: ❌ Killed — phantom car issue
- Learned:
- close_and_switch() between segments creates phantom car risk for single-track training
- throttle_min MUST be passed consistently — module default is 0.2, not 0.5
- For single-track training: do NOT use close_and_switch() at all
Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW)
- Reward: v5 + short-lap terminates episode
- throttle_min: 0.5 throughout (no close_and_switch = no module default override)
- Method: Direct model.learn() in loop — ONE connection throughout entire run
- Checkpoints: 15 numbered saves (every 6,000 steps) + best_model.zip
- PID: 2941877, log: /tmp/exp8.log
- Status: Running since 11:17, ~1h45m total
- Watch:
tail -f /tmp/exp8.log - Success criteria: Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval
Wave 4 Multi-Track Experiments (generated_track + mountain_track)
Trial 9 ⭐ BEST OVERALL MODEL
- Model:
models/wave4-trial-0009/model.zip - Tracks: generated_track + mountain_track (round-robin, switch every 6,851 steps)
- Steps: 89,893 total (~45k per track)
- Hyperparams: lr=0.000725, switch=6,851
- Reward: v4 (old — before exploit patches)
- Result:
- ✅ Drives generated_track (3/3 episodes, 13-16 second genuine laps)
- ✅ Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training)
- ❌ Crashes on mountain_track (~200 steps — hill + corner)
- ❌ Crashes on generated_road (~46 steps — turns right immediately)
- Notes: Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model.
Wave 4 Other Trials (1-25 except Trial 9)
- Result: All crashed on mini_monaco within 20-265 steps
- Median mini_monaco score: ~112 (crashes at ~130 steps)
- Trials 14, 25: Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight)
- Learned: Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably.
Key Decisions Made (What We Keep)
| Decision | Reason |
|---|---|
v5 reward: speed × CTE-quality |
Directly incentivises throttle on hills. v4 gave zero gradient on inclines. |
| throttle_min=0.5 for mountain_track | Overcomes hill. Car can now reach first corner. |
| Short-lap penalty + EPISODE TERMINATION | Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable. |
| Numbered checkpoints every segment | Never lose a good mid-training model again (ADR-017) |
| best_model.zip updated on new best segment score | Final model ≠ best model. Peak can be at 30k even if final is at 90k. |
| Single TCP connection for single-track training | Avoids phantom car problem from close_and_switch() |
| lr=0.000725 | From Trial 9 (best model). Consistent with good results. |
Key Problems Still Open
| Problem | Status |
|---|---|
| Mountain track circle exploit | Partially fixed — episode termination added. Exp8 will show if it holds. |
| Mountain track — car can't navigate first corner reliably | Still being investigated. Exp5 showed genuine laps so it IS solvable. |
| Multi-track generalization is random-seed dependent | No reliable solution yet. Trial 9 was lucky. |
| Mountain track model doesn't generalise to other tracks | Expected — single track training generalises poorly. Next step after Exp8 succeeds. |
Next Steps (Proposed, Not Yet Run)
- Exp 8 result: If best_model.zip drives mountain_track reliably → proceed to Step 2
- Combine mountain_track + generated_track using v5 reward, throttle_min=0.5, proper checkpointing
- Test combined model on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
- If yes: We have reproduced Trial 9 reliably with a better reward function
Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED
- Reward: v5 (speed × CTE-quality)
- throttle_min: 0.5
- Method: Direct model.learn() loop, single TCP connection, NO close_and_switch
- Steps: 90,000 total | 6,000 per segment | 15 checkpoints
- Circle exploit fix: Short-lap terminates episode immediately
- Peak segment: Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
- Policy diverged: Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
- Checkpoints saved: checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
- Final eval results using best_model.zip (step 60k weights):
| Track | Ep1 | Ep2 | Ep3 | Mean steps | Result |
|---|---|---|---|---|---|
| mountain_track (training) | 382 | 529 | 182 | 364 | ❌ crashes |
| generated_track (zero-shot) | 63 | 61 | 61 | 62 | ❌ crashes |
| mini_monaco (zero-shot) | 154 | 155 | 104 | 138 | ❌ crashes at one corner |
| generated_road (zero-shot) | 41 | 42 | 41 | 41 | ❌ crashes |
- Throttle test: mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
- Key findings:
- ✅ Circle exploit fully eliminated — no short laps observed
- ✅ Best model saving worked — captured step 60k peak, not step 90k drift
- ✅ Genuine 20-22 second laps during training from step ~18k onward
- ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
- ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
- 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor
Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)
- Change from Exp8: throttle_min: 0.5 → 0.2 (only change)
- Reward: v5 (speed × CTE-quality) — UNCHANGED
- Hypothesis: v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero). Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0] and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
- What we never tested: (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
- Risk: Model may still stall on hill if gradient convergence is slow in early training. StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
- Next test (Exp10): Add track_progress bonus to reward (v6) — one variable at a time.
- Save dir: models/exp9-mountain-v5-throttle02/
- Watch: tail -f /tmp/exp9.log