32 KiB
Test History — DonkeyCar RL Autoresearch
Last updated: 2026-04-19
This document records every significant training experiment, what was changed, what was observed, and what was learned. Use this to make methodical decisions rather than random changes.
Baseline Models (Phase 1 & 2)
Phase 2 Champion
- Model:
models/champion/model.zip - Track trained on: generated_road only
- Steps: 13,328
- Hyperparams: lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward
- Result: ✅ Drives generated_road perfectly, stays in right lane
- Zero-shot: ❌ Fails on generated_track (confirmed), ❌ Fails on mini_monaco
- Notes: Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps)
Mountain Track Experiments
All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps
Exp 1 — Mountain track, old v4 reward, throttle_min=0.2
- Reward: v4 (CTE × efficiency × speed)
- throttle_min: 0.2
- Key observation: Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power
- Root cause: v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle")
- Learned: v4 reward is broken for inclined terrain
Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k
- Reward: v4
- throttle_min: 0.2
- Key observation: Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn
- Result: ❌ Killed early — no improvement
- Learned: More steps alone cannot fix a broken reward signal
Exp 3 — Mountain track, old v4 reward, throttle_min=0.5
- Reward: v4
- throttle_min: 0.5 (increased to overcome hill)
- Key observation: Circle exploit dominated entire run — 0.5-1.75 second laps throughout
- Lap times logged: All short (exploit)
- Result: ❌ Model useless (reward=4.99 after 90k steps)
- Learned: Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high
Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5
- Reward: v4
- throttle_min: 0.5
- Key observation: Killed early — same 2 behaviors (left into barrier, straight into barrier)
- Result: ❌ Killed
- Learned: Continuing bad training does not help
Exp 5 — Mountain track, v5 reward, throttle_min=0.5 ⭐ KEY EXPERIMENT
- Reward: v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills
- throttle_min: 0.5
- Method: Direct model.learn() — NO train_multitrack(), ONE connection throughout
- Key observation: Genuine 20-22 second laps appearing from step ~30,000 onward
- Lap times: 19-22 seconds (genuine), consistently for 60k steps
- Result: ❌ Final model poor — best model was at step ~30k but we only saved final (step 90k) model
- Root cause of failure: No best-model saving. Policy peaked at 30k, diverged by 90k
- Learned:
- v5 reward WORKS for mountain track
- throttle_min=0.5 WORKS for hill
- Direct model.learn() (no track switching) avoids phantom car issues
- MUST save best model during training, not just final
Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment)
- Reward: v5
- throttle_min: 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments)
- Method: train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint)
- Key observation: Circle exploit dominated — only 0.5-1.75 second laps throughout
- Result: ❌ Only 1 checkpoint saved (at step 90k). Best reward=4.99
- Root cause: Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation)
- Learned: steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing
Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed
- Reward: v5 + short-lap now TERMINATES episode (not just penalty)
- throttle_min: 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default)
- Method: train_multitrack() with steps_per_switch=6000 (15 segments)
- Key observation: Car in LEFT lane, sitting doing nothing. Not normal spawn position.
- Hypothesis: Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car.
- Result: ❌ Killed — phantom car issue
- Learned:
- close_and_switch() between segments creates phantom car risk for single-track training
- throttle_min MUST be passed consistently — module default is 0.2, not 0.5
- For single-track training: do NOT use close_and_switch() at all
Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW)
- Reward: v5 + short-lap terminates episode
- throttle_min: 0.5 throughout (no close_and_switch = no module default override)
- Method: Direct model.learn() in loop — ONE connection throughout entire run
- Checkpoints: 15 numbered saves (every 6,000 steps) + best_model.zip
- PID: 2941877, log: /tmp/exp8.log
- Status: Running since 11:17, ~1h45m total
- Watch:
tail -f /tmp/exp8.log - Success criteria: Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval
Wave 4 Multi-Track Experiments (generated_track + mountain_track)
Trial 9 ⭐ BEST OVERALL MODEL
- Model:
models/wave4-trial-0009/model.zip - Tracks: generated_track + mountain_track (round-robin, switch every 6,851 steps)
- Steps: 89,893 total (~45k per track)
- Hyperparams: lr=0.000725, switch=6,851
- Reward: v4 (old — before exploit patches)
- Result:
- ✅ Drives generated_track (3/3 episodes, 13-16 second genuine laps)
- ✅ Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training)
- ❌ Crashes on mountain_track (~200 steps — hill + corner)
- ❌ Crashes on generated_road (~46 steps — turns right immediately)
- Notes: Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model.
Wave 4 Other Trials (1-25 except Trial 9)
- Result: All crashed on mini_monaco within 20-265 steps
- Median mini_monaco score: ~112 (crashes at ~130 steps)
- Trials 14, 25: Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight)
- Learned: Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably.
Key Decisions Made (What We Keep)
| Decision | Reason |
|---|---|
v5 reward: speed × CTE-quality |
Directly incentivises throttle on hills. v4 gave zero gradient on inclines. |
| throttle_min=0.5 for mountain_track | Overcomes hill. Car can now reach first corner. |
| Short-lap penalty + EPISODE TERMINATION | Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable. |
| Numbered checkpoints every segment | Never lose a good mid-training model again (ADR-017) |
| best_model.zip updated on new best segment score | Final model ≠ best model. Peak can be at 30k even if final is at 90k. |
| Single TCP connection for single-track training | Avoids phantom car problem from close_and_switch() |
| lr=0.000725 | From Trial 9 (best model). Consistent with good results. |
Key Problems Still Open
| Problem | Status |
|---|---|
| Mountain track circle exploit | Partially fixed — episode termination added. Exp8 will show if it holds. |
| Mountain track — car can't navigate first corner reliably | Still being investigated. Exp5 showed genuine laps so it IS solvable. |
| Multi-track generalization is random-seed dependent | No reliable solution yet. Trial 9 was lucky. |
| Mountain track model doesn't generalise to other tracks | Expected — single track training generalises poorly. Next step after Exp8 succeeds. |
Next Steps (Proposed, Not Yet Run)
- Exp 8 result: If best_model.zip drives mountain_track reliably → proceed to Step 2
- Combine mountain_track + generated_track using v5 reward, throttle_min=0.5, proper checkpointing
- Test combined model on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
- If yes: We have reproduced Trial 9 reliably with a better reward function
Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED
- Reward: v5 (speed × CTE-quality)
- throttle_min: 0.5
- Method: Direct model.learn() loop, single TCP connection, NO close_and_switch
- Steps: 90,000 total | 6,000 per segment | 15 checkpoints
- Circle exploit fix: Short-lap terminates episode immediately
- Peak segment: Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
- Policy diverged: Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
- Checkpoints saved: checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
- Final eval results using best_model.zip (step 60k weights):
| Track | Ep1 | Ep2 | Ep3 | Mean steps | Result |
|---|---|---|---|---|---|
| mountain_track (training) | 382 | 529 | 182 | 364 | ❌ crashes |
| generated_track (zero-shot) | 63 | 61 | 61 | 62 | ❌ crashes |
| mini_monaco (zero-shot) | 154 | 155 | 104 | 138 | ❌ crashes at one corner |
| generated_road (zero-shot) | 41 | 42 | 41 | 41 | ❌ crashes |
- Throttle test: mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
- Key findings:
- ✅ Circle exploit fully eliminated — no short laps observed
- ✅ Best model saving worked — captured step 60k peak, not step 90k drift
- ✅ Genuine 20-22 second laps during training from step ~18k onward
- ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
- ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
- 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor
Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)
- Change from Exp8: throttle_min: 0.5 → 0.2 (only change)
- Reward: v5 (speed × CTE-quality) — UNCHANGED
- Hypothesis: v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero). Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0] and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
- What we never tested: (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
- Risk: Model may still stall on hill if gradient convergence is slow in early training. StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
- Next test (Exp10): Add track_progress bonus to reward (v6) — one variable at a time.
- Save dir: models/exp9-mountain-v5-throttle02/
- Watch: tail -f /tmp/exp9.log
Exp 9 — Evaluation Results (3-set test, 1 run per track per set)
Model tested: models/exp9-mountain-v5-throttle02/best_model.zip
Date: 2026-04-18
Test setup: 3 independent sets, lighting randomises each run (no fixed seed)
| Track | Set 1 | Set 2 | Set 3 | Mean | Pattern |
|---|---|---|---|---|---|
| mountain_track (trained) | ✅ 2000 | ✅ 2000 | ✅ 2000 | 2000 | Rock solid |
| generated_track (zero-shot) | ❌ 79 | ❌ 61 | ❌ 82 | 74 | Always fails — can't make first corner |
| generated_road (zero-shot) | ❌ 651 | ✅ 2000 | ❌ 1203 | 1285 | Highly variable — lighting dependent |
| mini_monaco (zero-shot) | ❌ 32 | ❌ 60 | ❌ 34 | 42 | Always fails — veers right immediately |
User observations:
- mountain_track: 80-90% of time on or near centre yellow line. Solid driving.
- generated_road: Driving looks good when it works, but goes off course. Lighting variation causes inconsistency.
- generated_track: Cannot make first corner at all. Model sees nothing it recognises.
- mini_monaco: Veers right immediately at start before any visible driving. Crashes before reaching the road.
Key finding — Lighting effect confirmed: Generated_road varies 651→2000→1203 with identical model and track. ONLY lighting changes. Mountain_track is immune because it trained under many random lighting conditions. Generated_track and mini_monaco fail regardless of lighting — visual domain too different.
What this tells us about next steps: Train on mountain_track + generated_track together (v5 reward, throttle_min=0.2). Both tracks have random lighting each episode → model forced to learn lighting-invariant features. Goal: model that is reliable on both training tracks, then test generalisation to generated_road and mini_monaco.
Exp 10 — Two tracks: generated_track + mountain_track, v5 reward, throttle_min=0.2
- Change from Exp9: Added generated_track as second training track
- Reward: v5 (speed × CTE) — unchanged
- throttle_min: 0.2 — unchanged from Exp9
- Training tracks: generated_track + mountain_track (round-robin, switch every 6,000 steps)
- Total steps: 90,000 | Steps per switch: 6,000 | ~7.5 rotations through both tracks
- lr: 0.000725 — unchanged
- Hypothesis: Adding generated_track visual diversity forces model to learn lighting-invariant road-following features. Mountain_track teaches hill throttle. Together should generalise better to generated_road and potentially mini_monaco.
- Expected results: mountain_track reliable, generated_track reliable, generated_road improved, mini_monaco TBD
- This is essentially Trial 9 repeated with: v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix
Exp 10 — Evaluation Results (3-set test, 2026-04-19)
Model tested: models/exp10-two-tracks/best_model.zip
Result: TOTAL FAILURE — crashes on every track, every set.
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|---|---|---|---|---|---|
| mountain_track (trained) | 178 | 179 | 179 | 179 | ❌ Crashes at same spot every time |
| generated_track (trained) | 99 | 82 | 88 | 90 | ❌ Crashes almost immediately |
| generated_road (zero-shot) | 135 | 223 | 105 | 154 | ❌ Crashes early |
| mini_monaco (zero-shot) | 111 | 133 | 129 | 124 | ❌ Crashes early |
Comparison to previous best models:
- Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. 91% regression.
- Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124.
Analysis:
- The round-robin track switching every 6,000 steps via
multitrack_runner.train_multitrack()produced a model that learned NEITHER track. This is catastrophic interference. - Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different
hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in
HOW the environment switching works —
multitrack_runnercloses and reopens envs, potentially disrupting PPO's rollout buffer and value function estimates. - Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has learned a fixed degenerate policy (always turn one direction) rather than responding to vision.
Key question: Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed?
Possible causes: (1) env close/reopen resets PPO internal state, (2) best_model selection
criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller.
Full log: agent/test-results/2026-04-19_10-15_exp10-two-tracks.log
Exp 9 vs Exp 10 — Root Cause Analysis
| Aspect | Exp 9 (worked ✅) | Exp 10 (failed ❌) |
|---|---|---|
| Tracks | mountain_track only | generated_track + mountain_track (round-robin) |
| Env setup | VecTransposeImage(DummyVecEnv([make_env])) — created ONCE, never closed |
wrap_env(raw) passed to PPO, which auto-wraps; closed and reopened every 6k steps |
| Track switching | None — single env for entire 90k steps | close_and_switch() — close env, exit_scene, sleep, gym.make new track |
| PPO continuity | Single model.learn() calls with reset_num_timesteps=False, same env |
model.learn() + model.set_env(new_env) after each switch |
| Eval between segments | Direct env.reset() + predict loop on same env |
Same, but env may be a different track than what was just trained |
| Best model selection | Based on eval reward on mountain_track | Based on segment reward — could be from either track |
Conclusion: Exp 9 kept a single persistent env connection for all 90k steps.
Exp 10 closed and reopened the env every 6k steps with model.set_env().
This likely disrupts PPO's rollout buffer, value estimates, and observation normalization.
Exp 9 was a completely different (simpler) script with no track switching at all.
Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work?
Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME
multitrack_runner.py code — yet Trial 9 scored 1435 on mini_monaco (zero-shot)
while Exp 10 crashes on every track at <180 steps.
Wave 4 Trial 9 parameters:
- lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893
- Trained on generated_track + mountain_track (same as Exp 10)
- Used
multitrack_runner.pyvia CLI subprocess (same close_and_switch logic)
Exp 10 parameters:
- lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000
- Nearly identical to Trial 9
But Wave 4 was mostly failures too:
| Metric | Value |
|---|---|
| Total Wave 4 trials | 25 |
| Scores > 500 | 4 / 25 (16%) |
| Scores > 200 | 5 / 25 (20%) |
| Median score | 111.3 |
| Mean score | 343.8 |
| Std deviation | 566.2 |
The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when re-tested (see STATE.md).
The real conclusion: Trial 9's success was likely due to lucky random initialization of CNN weights. With 80% of trials failing under the same training methodology, the multitrack round-robin approach via close_and_switch is fundamentally unreliable. The few successes are random seed lottery winners, not evidence that the method works.
Wave 5 reproduction attempt: We tried training on generated_track only
(single track, no switching, same lr=0.000725, 90k steps) to test whether
the track-switching was the problem. Result stored in models/wave5-gentrack-only/.
(Results were poor — could not reproduce Trial 9's quality.)
Open question: Is there a reliable way to do multi-track training, or should we focus on single-track training with domain randomization (lighting, camera angle) to achieve generalization instead?
Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED)
- Date: 2026-04-19
- Change from Exp10: Two sim instances (port 9091 + 9093), DummyVecEnv wraps both. PPO sees both tracks in every rollout batch. No close_and_switch.
- Tracks: generated_track (9091) + mountain_track (9093)
- Reward: v5 (speed × CTE) — same as Exp 9/10
- Result: ABORTED at 66k/90k steps. Circular driving observed on generated_track. v5 reward has no efficiency term → circles at CTE≈0 earn positive reward.
- Positive: Parallel env infrastructure works! Both sims connected, PPO trained stably with no env switching issues. Consistent improvement 14.7→67.8 combined.
- Negative: Circular driving exploit returned because v5 dropped efficiency.
Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate)
- Date: 2026-04-19
- Change from Exp11: Reward v6 (speed × CTE + efficiency gate ≥ 0.15). Also stuck_steps 80→40 (faster stuck termination).
- Tracks: generated_track (9091) + mountain_track (9093)
- Total steps: 90,000 | lr=0.000725 | throttle_min=0.2
Training progress (eval at each 6k checkpoint):
| Steps | gen_track | mountain | Combined | Note |
|---|---|---|---|---|
| 6k | 91s | 130s | 10.7r | Early |
| 18k | 100s | 100s | 15.9r | Improving |
| 36k | 161s | 160s | 26.2r | ⭐ |
| 42k | 160s | 159s | 28.9r | ⭐ |
| 60k | 164s | 163s | — | Plateau |
| 78k | 169s | 168s | 29.2r | ⭐ |
| 90k | 173s | 172s | — | End |
Evaluation results (best_model, 3 sets per track):
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|---|---|---|---|---|---|
| mountain_track (trained) | 195 | 196 | 192 | 194 | ❌ |
| generated_track (trained) | 192 | 194 | 192 | 193 | ❌ |
| generated_road (zero-shot) | 192 | 196 | 194 | 194 | ❌ |
| mini_monaco (zero-shot) | 194 | 192 | 196 | 194 | ❌ |
Analysis:
- ✅ No circular driving (efficiency gate works)
- ✅ Remarkably consistent: all tracks ~194 steps, very low variance
- ✅ Parallel env infrastructure is stable and reliable
- ❌ Model plateaus at ~170-195 steps and never improves past that
- ❌ Much worse than Exp 9 (mountain only: 2000/2000) or Wave 4 Trial 9 (2000/2000)
- The consistency across all 4 tracks (including zero-shot) suggests the model learned a generic short-drive policy, not track-specific features
- Possible cause: 90k steps may be insufficient for 2-env parallel training (effective steps per track = 45k each), or the efficiency gate may be suppressing early exploration
Key findings:
- Parallel DummyVecEnv works mechanically — this is the right infrastructure
- v6 reward prevents circular driving
- But 90k steps with 2 parallel envs may not be enough training budget
- Compare: Exp 9 (single track, 90k steps, v5) → 2000 steps. Exp 11b (2 tracks, 90k steps, v6) → 194 steps. The training budget per track is halved AND the reward is harder to exploit.
Next experiments to consider:
- Increase total_timesteps to 180k-250k (restore per-track budget)
- Try v6 reward on single track first to isolate reward vs multi-track effects
- Try v5 reward with parallel envs but longer training (accept some circling)
- Check if efficiency gate triggers too aggressively during normal cornering
Exp 14b — Mountain finetune from exp14 champion (2026-04-19)
- Script:
agent/experiments/exp14_finetune_v5.py - Warm start:
agent/models/exp14-mountain-v5/best_model.zip - Schedule:
- phase 1: runtime throttle floor
0.4 - phase 2: runtime throttle floor
0.2
- phase 1: runtime throttle floor
- Goal: improve hill climbing, robustness, and lap time on
mountain_track
Important outcome
The finetune run did not improve monotonically. It briefly improved, then later degraded badly. This means the final/latest checkpoint is not the model we want to keep.
Candidate checkpoint comparison
We ran a fresh deterministic comparison on mountain only:
- 9 episodes per model
- 2000 step cap
- Results saved to:
agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonlagent/outerloop-results/mountain_candidate_eval_2026-04-19.md
| Model | Floor | Success eps | Full 2k eps | Avg laps/ep | Total laps | Mean lap | Best lap | Avg steps | Verdict |
|---|---|---|---|---|---|---|---|---|---|
| exp14_base | 0.2 | 7/9 | 3/9 | 1.78 | 16 | 29.24s | 27.02s | 1332 | Original champion |
| ft_006k | 0.4 | 1/9 | 0/9 | 0.11 | 1 | 21.36s | 21.36s | 335 | Very fast but unusably fragile |
| ft_024k | 0.4 | 4/9 | 0/9 | 0.56 | 5 | 21.58s | 20.53s | 575 | Fast but fragile |
| ft_030k | 0.4 | 1/9 | 0/9 | 0.22 | 2 | 21.53s | 20.72s | 317 | Very fast but unusably fragile |
| ft_036k | 0.2 | 9/9 | 6/9 | 2.78 | 25 | 27.93s | 26.16s | 1841 | Best overall balance |
| ft_042k | 0.2 | 8/9 | 4/9 | 1.89 | 17 | 29.25s | 27.09s | 1404 | Decent, but worse than 36k |
| ft_048k | 0.2 | 6/9 | 3/9 | 1.44 | 13 | 31.15s | 28.31s | 1127 | Degraded |
Best model captured
Best overall checkpoint from the finetune:
agent/models/exp14-mountain-v5-finetune/checkpoint_0036000.zip
Promoted copy saved as:
agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip
Key learning
- Early
0.4-floor checkpoints can produce very fast laps, but are too fragile to trust. - The best mountain finetune model is the 36k checkpoint after switching back to 0.2 floor, not the later checkpoints.
- Later finetune checkpoints collapsed badly, matching the user's visual observation of wheelspin / poor driving.
Exp 15 — Generated track warm-start from mountain champion (2026-04-19)
- Script:
agent/experiments/exp15_gentrack_from_mountain.py - Warm start:
agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip - Target track:
generated_track - Target setup: Exp 13-style v4 generated-track training
- Result: ❌ Failed
Observed behavior:
- Model tried exploit-like behavior near the start / first corner
- Did not learn clean generated-track driving
- By ~25k steps, it was clearly far behind the known-good scratch run
Log evidence:
[20,000] reward=45.0 steps=47 laps=0[25,000] reward=23.4 steps=30 laps=0- Short exploit laps appeared in the log (
6.5s,4.91s)
Conclusion:
- Mountain → generated warm-start transfer is poor in this direct setup
- The mountain policy prior seems to bias the agent toward bad local behavior instead of helping generated-track learning
Exp 16 — Mountain track warm-start from generated champion (2026-04-19)
- Script:
agent/experiments/exp16_mountain_from_gentrack.py - Warm start:
agent/models/exp13-gentrack-v4/best_model.zip - Target track:
mountain_track - Target setup: Exp 14-style v5 mountain training
- Result: ❌ Failed
Observed behavior:
- No meaningful mountain learning
- Repeated short crash pattern
- Never developed lap-completing mountain behavior
Log evidence:
[210,000] reward=10.2 steps=195 laps=0[215,000] reward=10.1 steps=193 laps=0
Conclusion:
- Generated → mountain warm-start transfer is also poor in this direct setup
- The generated-track champion does not bootstrap mountain hill learning effectively here
Transfer-learning takeaway (current evidence)
Direct cross-track warm starts failed in both directions:
- mountain → generated: failed / exploit-prone
- generated → mountain: failed / short-crash plateau
Current interpretation:
- the single-track policies are too specialized for naive direct transfer, and/or
- the mountain sim physics differences are large enough to break transfer
For now:
- keep the single-track champions as separate specialists
- do not assume direct cross-track warm starts are beneficial
Mountain Track Friction Fix (2026-04-27)
Root cause
WheelPhys.cs scales wheel grip by the static friction of whatever surface the
wheel is touching: fFriction.stiffness = hit.collider.material.staticFriction * originalForwardStiffness.
mountain_track.unity assigned the Slippery physics material (staticFriction=0.1)
to 4 track surface colliders from the long_road prefab. This gave the car 1/5
the normal grip on the hill, causing visible wheelspin even at full throttle.
The Slippery material is intentional on genuinely icy surfaces (thunderhill) but was incorrect on mountain_track's asphalt hill.
Fix applied
Replaced all 4 Slippery material assignments with Road material (staticFriction=0.5)
in sdsim/Assets/Scenes/mountain_track.unity.
| Material | staticFriction | GUID |
|---|---|---|
| Slippery (removed) | 0.1 | c0e12c099c364af4e9e311a43d0f12c4 |
| Road (applied) | 0.5 | 7884193b0ead347a38a13a67f294dfb5 |
To activate
The training setup uses the pre-built Windows executable (DonkeySimWin/donkey_sim.exe),
not a locally-compiled build. The scene file edit in sdsandbox/ has no effect on the
running binary — it only matters if the sim is ever rebuilt from source in Unity Editor.
This fix is deferred. Proceed with Exp 17 using the existing executable. If mountain hill training in Exp 17 specifically struggles (short episodes that plateau and never improve), that is the signal to pursue a Unity Editor rebuild.
The scene file change is committed in sdsandbox/ and will apply automatically if the sim is rebuilt for any other reason. No Python code changes needed.
Expected effect
- Hill wheelspin should stop or greatly reduce
- Throttle_min=0.2 + v5 reward should be even more effective on the hill
- All future mountain experiments benefit; no code changes needed
Strategy Review and Exp 17 Plan (2026-04-27)
Where the project stands
After 16 experiments and 4 autoresearch phases, the core problem is clear: multi-track training is needed for generalisation, but the training method has been unreliable. Here is the summary of what each approach found:
| Approach | Outcome |
|---|---|
| Round-robin close-and-switch (Wave 4, Exp 10) | 80% failure. PPO rollout buffer disrupted on env swap. Lucky seed (Trial 9) worked once but cannot be reproduced. |
| Parallel DummyVecEnv 90k steps (Exp 11b) | Infrastructure valid, no catastrophic forgetting, but 90k steps / 2 tracks = ~45k effective per track. Not enough. |
| Cross-track warm starts (Exp 15, 16) | Both directions failed. Single-track specialists do not transfer cleanly. |
| Single-track PPO (Exp 9, 13, 14) | Reliable but no generalisation. |
The conclusion: parallel DummyVecEnv is the right architecture; the only known failure mode is training budget. Exp 11b was mechanically sound but starved of steps.
Exp 17 — Parallel DummyVecEnv, 400k–500k steps
This is the primary next experiment.
| Parameter | Value | Reason |
|---|---|---|
| Architecture | DummyVecEnv([generated_track:9091, mountain_track:9093]) | Validated in Exp 11b; no PPO disruption |
| Total timesteps | 400,000–500,000 | ~200k effective per track; Exp 11b proved 90k insufficient |
| Reward | v6 on both envs (efficiency gate + CTE patience terminator) | Blocks circular exploit on generated_track; gate threshold may be tuned |
| throttle_min | 0.2 both envs (or 0.5 mountain, 0.2 generated — see ADR-020) | v5/v6 gradient non-zero on hills at 0.2 |
| learning_rate | 0.000725 | From Trial 9 and Exp 9 — consistent with best results |
| Checkpoint | every 20,000 steps + best_model.zip tracked throughout | ADR-017: best model ≠ final model |
| Eval | mini_monaco zero-shot at every checkpoint | Detect the peak before policy drifts |
| Warm start | None — train from random weights | ADR-024: cross-track warm starts failed |
Setup checklist before running:
- Two sim instances running: one on port 9091, one on port 9093
- Both on the same track as configured (generated_track and mountain_track)
- Rebuild simulator with mountain friction fix active
- Verify throughput: run 2-minute timing benchmark, set step cap accordingly (ADR-014)
Success criterion: mini_monaco zero-shot score > 500 (at least 25% of a full 2000-step episode) reliably across 3 evaluation sets, reproducible across 2+ runs.
Fallback: Curriculum training (if Exp 17 plateaus below 200)
If Exp 17 cannot get past ~200 steps on mini_monaco:
- Phase A: generated_track only, 150k steps (establish road-following)
- Phase B: add mountain_track to DummyVecEnv, continue 250k more steps
- Rationale: gives the policy a foundation before the harder mountain physics
Fallback: v6 efficiency gate tuning (if gate is too aggressive)
Log what fraction of steps are gated (reward zeroed) in the first 100k steps. If >40%, lower the gate threshold from 0.15 to 0.10 for the first 150k steps, then raise it back to 0.15. Prevents the gate from suppressing early exploration.