# Test History — DonkeyCar RL Autoresearch Last updated: 2026-04-18 This document records every significant training experiment, what was changed, what was observed, and what was learned. Use this to make methodical decisions rather than random changes. --- ## Baseline Models (Phase 1 & 2) ### Phase 2 Champion - **Model:** `models/champion/model.zip` - **Track trained on:** generated_road only - **Steps:** 13,328 - **Hyperparams:** lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward - **Result:** ✅ Drives generated_road perfectly, stays in right lane - **Zero-shot:** ❌ Fails on generated_track (confirmed), ❌ Fails on mini_monaco - **Notes:** Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps) --- ## Mountain Track Experiments All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps ### Exp 1 — Mountain track, old v4 reward, throttle_min=0.2 - **Reward:** v4 (CTE × efficiency × speed) - **throttle_min:** 0.2 - **Key observation:** Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power - **Root cause:** v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle") - **Learned:** v4 reward is broken for inclined terrain ### Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k - **Reward:** v4 - **throttle_min:** 0.2 - **Key observation:** Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn - **Result:** ❌ Killed early — no improvement - **Learned:** More steps alone cannot fix a broken reward signal ### Exp 3 — Mountain track, old v4 reward, throttle_min=0.5 - **Reward:** v4 - **throttle_min:** 0.5 (increased to overcome hill) - **Key observation:** Circle exploit dominated entire run — 0.5-1.75 second laps throughout - **Lap times logged:** All short (exploit) - **Result:** ❌ Model useless (reward=4.99 after 90k steps) - **Learned:** Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high ### Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5 - **Reward:** v4 - **throttle_min:** 0.5 - **Key observation:** Killed early — same 2 behaviors (left into barrier, straight into barrier) - **Result:** ❌ Killed - **Learned:** Continuing bad training does not help ### Exp 5 — Mountain track, v5 reward, throttle_min=0.5 ⭐ KEY EXPERIMENT - **Reward:** v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills - **throttle_min:** 0.5 - **Method:** Direct model.learn() — NO train_multitrack(), ONE connection throughout - **Key observation:** Genuine 20-22 second laps appearing from step ~30,000 onward - **Lap times:** 19-22 seconds (genuine), consistently for 60k steps - **Result:** ❌ Final model poor — best model was at step ~30k but we only saved final (step 90k) model - **Root cause of failure:** No best-model saving. Policy peaked at 30k, diverged by 90k - **Learned:** 1. v5 reward WORKS for mountain track 2. throttle_min=0.5 WORKS for hill 3. Direct model.learn() (no track switching) avoids phantom car issues 4. MUST save best model during training, not just final ### Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment) - **Reward:** v5 - **throttle_min:** 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments) - **Method:** train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint) - **Key observation:** Circle exploit dominated — only 0.5-1.75 second laps throughout - **Result:** ❌ Only 1 checkpoint saved (at step 90k). Best reward=4.99 - **Root cause:** Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation) - **Learned:** steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing ### Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed - **Reward:** v5 + short-lap now TERMINATES episode (not just penalty) - **throttle_min:** 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default) - **Method:** train_multitrack() with steps_per_switch=6000 (15 segments) - **Key observation:** Car in LEFT lane, sitting doing nothing. Not normal spawn position. - **Hypothesis:** Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car. - **Result:** ❌ Killed — phantom car issue - **Learned:** 1. close_and_switch() between segments creates phantom car risk for single-track training 2. throttle_min MUST be passed consistently — module default is 0.2, not 0.5 3. For single-track training: do NOT use close_and_switch() at all ### Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW) - **Reward:** v5 + short-lap terminates episode - **throttle_min:** 0.5 throughout (no close_and_switch = no module default override) - **Method:** Direct model.learn() in loop — ONE connection throughout entire run - **Checkpoints:** 15 numbered saves (every 6,000 steps) + best_model.zip - **PID:** 2941877, log: /tmp/exp8.log - **Status:** Running since 11:17, ~1h45m total - **Watch:** `tail -f /tmp/exp8.log` - **Success criteria:** Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval --- ## Wave 4 Multi-Track Experiments (generated_track + mountain_track) ### Trial 9 ⭐ BEST OVERALL MODEL - **Model:** `models/wave4-trial-0009/model.zip` - **Tracks:** generated_track + mountain_track (round-robin, switch every 6,851 steps) - **Steps:** 89,893 total (~45k per track) - **Hyperparams:** lr=0.000725, switch=6,851 - **Reward:** v4 (old — before exploit patches) - **Result:** - ✅ Drives generated_track (3/3 episodes, 13-16 second genuine laps) - ✅ Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training) - ❌ Crashes on mountain_track (~200 steps — hill + corner) - ❌ Crashes on generated_road (~46 steps — turns right immediately) - **Notes:** Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model. ### Wave 4 Other Trials (1-25 except Trial 9) - **Result:** All crashed on mini_monaco within 20-265 steps - **Median mini_monaco score:** ~112 (crashes at ~130 steps) - **Trials 14, 25:** Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight) - **Learned:** Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably. --- ## Key Decisions Made (What We Keep) | Decision | Reason | |---|---| | v5 reward: `speed × CTE-quality` | Directly incentivises throttle on hills. v4 gave zero gradient on inclines. | | throttle_min=0.5 for mountain_track | Overcomes hill. Car can now reach first corner. | | Short-lap penalty + EPISODE TERMINATION | Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable. | | Numbered checkpoints every segment | Never lose a good mid-training model again (ADR-017) | | best_model.zip updated on new best segment score | Final model ≠ best model. Peak can be at 30k even if final is at 90k. | | Single TCP connection for single-track training | Avoids phantom car problem from close_and_switch() | | lr=0.000725 | From Trial 9 (best model). Consistent with good results. | ## Key Problems Still Open | Problem | Status | |---|---| | Mountain track circle exploit | Partially fixed — episode termination added. Exp8 will show if it holds. | | Mountain track — car can't navigate first corner reliably | Still being investigated. Exp5 showed genuine laps so it IS solvable. | | Multi-track generalization is random-seed dependent | No reliable solution yet. Trial 9 was lucky. | | Mountain track model doesn't generalise to other tracks | Expected — single track training generalises poorly. Next step after Exp8 succeeds. | --- ## Next Steps (Proposed, Not Yet Run) 1. **Exp 8 result:** If best_model.zip drives mountain_track reliably → proceed to Step 2 2. **Combine mountain_track + generated_track** using v5 reward, throttle_min=0.5, proper checkpointing 3. **Test combined model** on all 4 tracks — can it generalise to mini_monaco like Trial 9 did? 4. **If yes:** We have reproduced Trial 9 reliably with a better reward function ### Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED - **Reward:** v5 (speed × CTE-quality) - **throttle_min:** 0.5 - **Method:** Direct model.learn() loop, single TCP connection, NO close_and_switch - **Steps:** 90,000 total | 6,000 per segment | 15 checkpoints - **Circle exploit fix:** Short-lap terminates episode immediately - **Peak segment:** Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!) - **Policy diverged:** Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly - **Checkpoints saved:** checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip - **Final eval results using best_model.zip (step 60k weights):** | Track | Ep1 | Ep2 | Ep3 | Mean steps | Result | |---|---|---|---|---|---| | mountain_track (training) | 382 | 529 | 182 | 364 | ❌ crashes | | generated_track (zero-shot) | 63 | 61 | 61 | 62 | ❌ crashes | | mini_monaco (zero-shot) | 154 | 155 | 104 | 138 | ❌ crashes at one corner | | generated_road (zero-shot) | 41 | 42 | 41 | 41 | ❌ crashes | - **Throttle test:** mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time. - **Key findings:** 1. ✅ Circle exploit fully eliminated — no short laps observed 2. ✅ Best model saving worked — captured step 60k peak, not step 90k drift 3. ✅ Genuine 20-22 second laps during training from step ~18k onward 4. ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast) 5. ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners 6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor ### Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING) - **Change from Exp8:** throttle_min: 0.5 → **0.2** (only change) - **Reward:** v5 (speed × CTE-quality) — UNCHANGED - **Hypothesis:** v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero). Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0] and can apply lower throttle on corners — potentially solving mini_monaco corner crash. - **What we never tested:** (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried. - **Risk:** Model may still stall on hill if gradient convergence is slow in early training. StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle. - **Next test (Exp10):** Add track_progress bonus to reward (v6) — one variable at a time. - **Save dir:** models/exp9-mountain-v5-throttle02/ - **Watch:** tail -f /tmp/exp9.log ### Exp 9 — Evaluation Results (3-set test, 1 run per track per set) **Model tested:** `models/exp9-mountain-v5-throttle02/best_model.zip` **Date:** 2026-04-18 **Test setup:** 3 independent sets, lighting randomises each run (no fixed seed) | Track | Set 1 | Set 2 | Set 3 | Mean | Pattern | |---|---|---|---|---|---| | mountain_track (trained) | ✅ 2000 | ✅ 2000 | ✅ 2000 | **2000** | Rock solid | | generated_track (zero-shot) | ❌ 79 | ❌ 61 | ❌ 82 | **74** | Always fails — can't make first corner | | generated_road (zero-shot) | ❌ 651 | ✅ 2000 | ❌ 1203 | **1285** | Highly variable — lighting dependent | | mini_monaco (zero-shot) | ❌ 32 | ❌ 60 | ❌ 34 | **42** | Always fails — veers right immediately | **User observations:** - mountain_track: 80-90% of time on or near centre yellow line. Solid driving. - generated_road: Driving looks good when it works, but goes off course. Lighting variation causes inconsistency. - generated_track: Cannot make first corner at all. Model sees nothing it recognises. - mini_monaco: Veers right immediately at start before any visible driving. Crashes before reaching the road. **Key finding — Lighting effect confirmed:** Generated_road varies 651→2000→1203 with identical model and track. ONLY lighting changes. Mountain_track is immune because it trained under many random lighting conditions. Generated_track and mini_monaco fail regardless of lighting — visual domain too different. **What this tells us about next steps:** Train on mountain_track + generated_track together (v5 reward, throttle_min=0.2). Both tracks have random lighting each episode → model forced to learn lighting-invariant features. Goal: model that is reliable on both training tracks, then test generalisation to generated_road and mini_monaco. ### Exp 10 — Two tracks: generated_track + mountain_track, v5 reward, throttle_min=0.2 - **Change from Exp9:** Added generated_track as second training track - **Reward:** v5 (speed × CTE) — unchanged - **throttle_min:** 0.2 — unchanged from Exp9 - **Training tracks:** generated_track + mountain_track (round-robin, switch every 6,000 steps) - **Total steps:** 90,000 | Steps per switch: 6,000 | ~7.5 rotations through both tracks - **lr:** 0.000725 — unchanged - **Hypothesis:** Adding generated_track visual diversity forces model to learn lighting-invariant road-following features. Mountain_track teaches hill throttle. Together should generalise better to generated_road and potentially mini_monaco. - **Expected results:** mountain_track reliable, generated_track reliable, generated_road improved, mini_monaco TBD - **This is essentially Trial 9 repeated with:** v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix ### Exp 10 — Evaluation Results (3-set test, 2026-04-19) **Model tested:** `models/exp10-two-tracks/best_model.zip` **Result: TOTAL FAILURE — crashes on every track, every set.** | Track | Set 1 | Set 2 | Set 3 | Mean | Verdict | |---|---|---|---|---|---| | mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot every time | | generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes almost immediately | | generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early | | mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early | **Comparison to previous best models:** - Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. **91% regression.** - Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124. **Analysis:** - The round-robin track switching every 6,000 steps via `multitrack_runner.train_multitrack()` produced a model that learned NEITHER track. This is catastrophic interference. - Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in HOW the environment switching works — `multitrack_runner` closes and reopens envs, potentially disrupting PPO's rollout buffer and value function estimates. - Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has learned a fixed degenerate policy (always turn one direction) rather than responding to vision. **Key question:** Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed? Possible causes: (1) env close/reopen resets PPO internal state, (2) `best_model` selection criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller. **Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log` ### Exp 9 vs Exp 10 — Root Cause Analysis | Aspect | Exp 9 (worked ✅) | Exp 10 (failed ❌) | |---|---|---| | **Tracks** | mountain_track **only** | generated_track + mountain_track (round-robin) | | **Env setup** | `VecTransposeImage(DummyVecEnv([make_env]))` — created ONCE, never closed | `wrap_env(raw)` passed to PPO, which auto-wraps; **closed and reopened** every 6k steps | | **Track switching** | None — single env for entire 90k steps | `close_and_switch()` — close env, exit_scene, sleep, gym.make new track | | **PPO continuity** | Single `model.learn()` calls with `reset_num_timesteps=False`, same env | `model.learn()` + `model.set_env(new_env)` after each switch | | **Eval between segments** | Direct `env.reset()` + predict loop on same env | Same, but env may be a different track than what was just trained | | **Best model selection** | Based on eval reward on mountain_track | Based on segment reward — could be from either track | **Conclusion:** Exp 9 kept a single persistent env connection for all 90k steps. Exp 10 closed and reopened the env every 6k steps with `model.set_env()`. This likely disrupts PPO's rollout buffer, value estimates, and observation normalization. Exp 9 was a completely different (simpler) script with no track switching at all. ### Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work? Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME `multitrack_runner.py` code — yet Trial 9 scored 1435 on mini_monaco (zero-shot) while Exp 10 crashes on every track at <180 steps. **Wave 4 Trial 9 parameters:** - lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893 - Trained on generated_track + mountain_track (same as Exp 10) - Used `multitrack_runner.py` via CLI subprocess (same close_and_switch logic) **Exp 10 parameters:** - lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000 - Nearly identical to Trial 9 **But Wave 4 was mostly failures too:** | Metric | Value | |---|---| | Total Wave 4 trials | 25 | | Scores > 500 | 4 / 25 (16%) | | Scores > 200 | 5 / 25 (20%) | | Median score | 111.3 | | Mean score | 343.8 | | Std deviation | 566.2 | The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when re-tested (see STATE.md). **The real conclusion:** Trial 9's success was likely due to **lucky random initialization of CNN weights**. With 80% of trials failing under the same training methodology, the multitrack round-robin approach via close_and_switch is fundamentally unreliable. The few successes are random seed lottery winners, not evidence that the method works. **Wave 5 reproduction attempt:** We tried training on generated_track only (single track, no switching, same lr=0.000725, 90k steps) to test whether the track-switching was the problem. Result stored in `models/wave5-gentrack-only/`. (Results were poor — could not reproduce Trial 9's quality.) **Open question:** Is there a reliable way to do multi-track training, or should we focus on single-track training with domain randomization (lighting, camera angle) to achieve generalization instead?