613 lines
32 KiB
Markdown
613 lines
32 KiB
Markdown
# Test History — DonkeyCar RL Autoresearch
|
||
|
||
Last updated: 2026-04-19
|
||
|
||
This document records every significant training experiment, what was
|
||
changed, what was observed, and what was learned. Use this to make
|
||
methodical decisions rather than random changes.
|
||
|
||
---
|
||
|
||
## Baseline Models (Phase 1 & 2)
|
||
|
||
### Phase 2 Champion
|
||
- **Model:** `models/champion/model.zip`
|
||
- **Track trained on:** generated_road only
|
||
- **Steps:** 13,328
|
||
- **Hyperparams:** lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward
|
||
- **Result:** ✅ Drives generated_road perfectly, stays in right lane
|
||
- **Zero-shot:** ❌ Fails on generated_track (confirmed), ❌ Fails on mini_monaco
|
||
- **Notes:** Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps)
|
||
|
||
---
|
||
|
||
## Mountain Track Experiments
|
||
|
||
All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps
|
||
|
||
### Exp 1 — Mountain track, old v4 reward, throttle_min=0.2
|
||
- **Reward:** v4 (CTE × efficiency × speed)
|
||
- **throttle_min:** 0.2
|
||
- **Key observation:** Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power
|
||
- **Root cause:** v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle")
|
||
- **Learned:** v4 reward is broken for inclined terrain
|
||
|
||
### Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k
|
||
- **Reward:** v4
|
||
- **throttle_min:** 0.2
|
||
- **Key observation:** Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn
|
||
- **Result:** ❌ Killed early — no improvement
|
||
- **Learned:** More steps alone cannot fix a broken reward signal
|
||
|
||
### Exp 3 — Mountain track, old v4 reward, throttle_min=0.5
|
||
- **Reward:** v4
|
||
- **throttle_min:** 0.5 (increased to overcome hill)
|
||
- **Key observation:** Circle exploit dominated entire run — 0.5-1.75 second laps throughout
|
||
- **Lap times logged:** All short (exploit)
|
||
- **Result:** ❌ Model useless (reward=4.99 after 90k steps)
|
||
- **Learned:** Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high
|
||
|
||
### Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5
|
||
- **Reward:** v4
|
||
- **throttle_min:** 0.5
|
||
- **Key observation:** Killed early — same 2 behaviors (left into barrier, straight into barrier)
|
||
- **Result:** ❌ Killed
|
||
- **Learned:** Continuing bad training does not help
|
||
|
||
### Exp 5 — Mountain track, v5 reward, throttle_min=0.5 ⭐ KEY EXPERIMENT
|
||
- **Reward:** v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills
|
||
- **throttle_min:** 0.5
|
||
- **Method:** Direct model.learn() — NO train_multitrack(), ONE connection throughout
|
||
- **Key observation:** Genuine 20-22 second laps appearing from step ~30,000 onward
|
||
- **Lap times:** 19-22 seconds (genuine), consistently for 60k steps
|
||
- **Result:** ❌ Final model poor — best model was at step ~30k but we only saved final (step 90k) model
|
||
- **Root cause of failure:** No best-model saving. Policy peaked at 30k, diverged by 90k
|
||
- **Learned:**
|
||
1. v5 reward WORKS for mountain track
|
||
2. throttle_min=0.5 WORKS for hill
|
||
3. Direct model.learn() (no track switching) avoids phantom car issues
|
||
4. MUST save best model during training, not just final
|
||
|
||
### Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment)
|
||
- **Reward:** v5
|
||
- **throttle_min:** 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments)
|
||
- **Method:** train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint)
|
||
- **Key observation:** Circle exploit dominated — only 0.5-1.75 second laps throughout
|
||
- **Result:** ❌ Only 1 checkpoint saved (at step 90k). Best reward=4.99
|
||
- **Root cause:** Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation)
|
||
- **Learned:** steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing
|
||
|
||
### Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed
|
||
- **Reward:** v5 + short-lap now TERMINATES episode (not just penalty)
|
||
- **throttle_min:** 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default)
|
||
- **Method:** train_multitrack() with steps_per_switch=6000 (15 segments)
|
||
- **Key observation:** Car in LEFT lane, sitting doing nothing. Not normal spawn position.
|
||
- **Hypothesis:** Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car.
|
||
- **Result:** ❌ Killed — phantom car issue
|
||
- **Learned:**
|
||
1. close_and_switch() between segments creates phantom car risk for single-track training
|
||
2. throttle_min MUST be passed consistently — module default is 0.2, not 0.5
|
||
3. For single-track training: do NOT use close_and_switch() at all
|
||
|
||
### Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW)
|
||
- **Reward:** v5 + short-lap terminates episode
|
||
- **throttle_min:** 0.5 throughout (no close_and_switch = no module default override)
|
||
- **Method:** Direct model.learn() in loop — ONE connection throughout entire run
|
||
- **Checkpoints:** 15 numbered saves (every 6,000 steps) + best_model.zip
|
||
- **PID:** 2941877, log: /tmp/exp8.log
|
||
- **Status:** Running since 11:17, ~1h45m total
|
||
- **Watch:** `tail -f /tmp/exp8.log`
|
||
- **Success criteria:** Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval
|
||
|
||
---
|
||
|
||
## Wave 4 Multi-Track Experiments (generated_track + mountain_track)
|
||
|
||
### Trial 9 ⭐ BEST OVERALL MODEL
|
||
- **Model:** `models/wave4-trial-0009/model.zip`
|
||
- **Tracks:** generated_track + mountain_track (round-robin, switch every 6,851 steps)
|
||
- **Steps:** 89,893 total (~45k per track)
|
||
- **Hyperparams:** lr=0.000725, switch=6,851
|
||
- **Reward:** v4 (old — before exploit patches)
|
||
- **Result:**
|
||
- ✅ Drives generated_track (3/3 episodes, 13-16 second genuine laps)
|
||
- ✅ Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training)
|
||
- ❌ Crashes on mountain_track (~200 steps — hill + corner)
|
||
- ❌ Crashes on generated_road (~46 steps — turns right immediately)
|
||
- **Notes:** Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model.
|
||
|
||
### Wave 4 Other Trials (1-25 except Trial 9)
|
||
- **Result:** All crashed on mini_monaco within 20-265 steps
|
||
- **Median mini_monaco score:** ~112 (crashes at ~130 steps)
|
||
- **Trials 14, 25:** Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight)
|
||
- **Learned:** Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably.
|
||
|
||
---
|
||
|
||
## Key Decisions Made (What We Keep)
|
||
|
||
| Decision | Reason |
|
||
|---|---|
|
||
| v5 reward: `speed × CTE-quality` | Directly incentivises throttle on hills. v4 gave zero gradient on inclines. |
|
||
| throttle_min=0.5 for mountain_track | Overcomes hill. Car can now reach first corner. |
|
||
| Short-lap penalty + EPISODE TERMINATION | Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable. |
|
||
| Numbered checkpoints every segment | Never lose a good mid-training model again (ADR-017) |
|
||
| best_model.zip updated on new best segment score | Final model ≠ best model. Peak can be at 30k even if final is at 90k. |
|
||
| Single TCP connection for single-track training | Avoids phantom car problem from close_and_switch() |
|
||
| lr=0.000725 | From Trial 9 (best model). Consistent with good results. |
|
||
|
||
## Key Problems Still Open
|
||
|
||
| Problem | Status |
|
||
|---|---|
|
||
| Mountain track circle exploit | Partially fixed — episode termination added. Exp8 will show if it holds. |
|
||
| Mountain track — car can't navigate first corner reliably | Still being investigated. Exp5 showed genuine laps so it IS solvable. |
|
||
| Multi-track generalization is random-seed dependent | No reliable solution yet. Trial 9 was lucky. |
|
||
| Mountain track model doesn't generalise to other tracks | Expected — single track training generalises poorly. Next step after Exp8 succeeds. |
|
||
|
||
---
|
||
|
||
## Next Steps (Proposed, Not Yet Run)
|
||
|
||
1. **Exp 8 result:** If best_model.zip drives mountain_track reliably → proceed to Step 2
|
||
2. **Combine mountain_track + generated_track** using v5 reward, throttle_min=0.5, proper checkpointing
|
||
3. **Test combined model** on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
|
||
4. **If yes:** We have reproduced Trial 9 reliably with a better reward function
|
||
|
||
|
||
### Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED
|
||
- **Reward:** v5 (speed × CTE-quality)
|
||
- **throttle_min:** 0.5
|
||
- **Method:** Direct model.learn() loop, single TCP connection, NO close_and_switch
|
||
- **Steps:** 90,000 total | 6,000 per segment | 15 checkpoints
|
||
- **Circle exploit fix:** Short-lap terminates episode immediately
|
||
- **Peak segment:** Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
|
||
- **Policy diverged:** Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
|
||
- **Checkpoints saved:** checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
|
||
- **Final eval results using best_model.zip (step 60k weights):**
|
||
|
||
| Track | Ep1 | Ep2 | Ep3 | Mean steps | Result |
|
||
|---|---|---|---|---|---|
|
||
| mountain_track (training) | 382 | 529 | 182 | 364 | ❌ crashes |
|
||
| generated_track (zero-shot) | 63 | 61 | 61 | 62 | ❌ crashes |
|
||
| mini_monaco (zero-shot) | 154 | 155 | 104 | 138 | ❌ crashes at one corner |
|
||
| generated_road (zero-shot) | 41 | 42 | 41 | 41 | ❌ crashes |
|
||
|
||
- **Throttle test:** mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
|
||
- **Key findings:**
|
||
1. ✅ Circle exploit fully eliminated — no short laps observed
|
||
2. ✅ Best model saving worked — captured step 60k peak, not step 90k drift
|
||
3. ✅ Genuine 20-22 second laps during training from step ~18k onward
|
||
4. ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
|
||
5. ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
|
||
6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor
|
||
|
||
### Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)
|
||
- **Change from Exp8:** throttle_min: 0.5 → **0.2** (only change)
|
||
- **Reward:** v5 (speed × CTE-quality) — UNCHANGED
|
||
- **Hypothesis:** v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero).
|
||
Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0]
|
||
and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
|
||
- **What we never tested:** (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
|
||
- **Risk:** Model may still stall on hill if gradient convergence is slow in early training.
|
||
StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
|
||
- **Next test (Exp10):** Add track_progress bonus to reward (v6) — one variable at a time.
|
||
- **Save dir:** models/exp9-mountain-v5-throttle02/
|
||
- **Watch:** tail -f /tmp/exp9.log
|
||
|
||
|
||
### Exp 9 — Evaluation Results (3-set test, 1 run per track per set)
|
||
|
||
**Model tested:** `models/exp9-mountain-v5-throttle02/best_model.zip`
|
||
**Date:** 2026-04-18
|
||
**Test setup:** 3 independent sets, lighting randomises each run (no fixed seed)
|
||
|
||
| Track | Set 1 | Set 2 | Set 3 | Mean | Pattern |
|
||
|---|---|---|---|---|---|
|
||
| mountain_track (trained) | ✅ 2000 | ✅ 2000 | ✅ 2000 | **2000** | Rock solid |
|
||
| generated_track (zero-shot) | ❌ 79 | ❌ 61 | ❌ 82 | **74** | Always fails — can't make first corner |
|
||
| generated_road (zero-shot) | ❌ 651 | ✅ 2000 | ❌ 1203 | **1285** | Highly variable — lighting dependent |
|
||
| mini_monaco (zero-shot) | ❌ 32 | ❌ 60 | ❌ 34 | **42** | Always fails — veers right immediately |
|
||
|
||
**User observations:**
|
||
- mountain_track: 80-90% of time on or near centre yellow line. Solid driving.
|
||
- generated_road: Driving looks good when it works, but goes off course. Lighting variation causes inconsistency.
|
||
- generated_track: Cannot make first corner at all. Model sees nothing it recognises.
|
||
- mini_monaco: Veers right immediately at start before any visible driving. Crashes before reaching the road.
|
||
|
||
**Key finding — Lighting effect confirmed:**
|
||
Generated_road varies 651→2000→1203 with identical model and track. ONLY lighting changes.
|
||
Mountain_track is immune because it trained under many random lighting conditions.
|
||
Generated_track and mini_monaco fail regardless of lighting — visual domain too different.
|
||
|
||
**What this tells us about next steps:**
|
||
Train on mountain_track + generated_track together (v5 reward, throttle_min=0.2).
|
||
Both tracks have random lighting each episode → model forced to learn lighting-invariant features.
|
||
Goal: model that is reliable on both training tracks, then test generalisation to generated_road and mini_monaco.
|
||
|
||
|
||
### Exp 10 — Two tracks: generated_track + mountain_track, v5 reward, throttle_min=0.2
|
||
- **Change from Exp9:** Added generated_track as second training track
|
||
- **Reward:** v5 (speed × CTE) — unchanged
|
||
- **throttle_min:** 0.2 — unchanged from Exp9
|
||
- **Training tracks:** generated_track + mountain_track (round-robin, switch every 6,000 steps)
|
||
- **Total steps:** 90,000 | Steps per switch: 6,000 | ~7.5 rotations through both tracks
|
||
- **lr:** 0.000725 — unchanged
|
||
- **Hypothesis:** Adding generated_track visual diversity forces model to learn
|
||
lighting-invariant road-following features. Mountain_track teaches hill throttle.
|
||
Together should generalise better to generated_road and potentially mini_monaco.
|
||
- **Expected results:** mountain_track reliable, generated_track reliable,
|
||
generated_road improved, mini_monaco TBD
|
||
- **This is essentially Trial 9 repeated with:** v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix
|
||
|
||
### Exp 10 — Evaluation Results (3-set test, 2026-04-19)
|
||
|
||
**Model tested:** `models/exp10-two-tracks/best_model.zip`
|
||
**Result: TOTAL FAILURE — crashes on every track, every set.**
|
||
|
||
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|
||
|---|---|---|---|---|---|
|
||
| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot every time |
|
||
| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes almost immediately |
|
||
| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early |
|
||
| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |
|
||
|
||
**Comparison to previous best models:**
|
||
- Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. **91% regression.**
|
||
- Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124.
|
||
|
||
**Analysis:**
|
||
- The round-robin track switching every 6,000 steps via `multitrack_runner.train_multitrack()`
|
||
produced a model that learned NEITHER track. This is catastrophic interference.
|
||
- Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different
|
||
hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in
|
||
HOW the environment switching works — `multitrack_runner` closes and reopens envs,
|
||
potentially disrupting PPO's rollout buffer and value function estimates.
|
||
- Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has
|
||
learned a fixed degenerate policy (always turn one direction) rather than responding to vision.
|
||
|
||
**Key question:** Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed?
|
||
Possible causes: (1) env close/reopen resets PPO internal state, (2) `best_model` selection
|
||
criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller.
|
||
|
||
**Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log`
|
||
|
||
### Exp 9 vs Exp 10 — Root Cause Analysis
|
||
|
||
| Aspect | Exp 9 (worked ✅) | Exp 10 (failed ❌) |
|
||
|---|---|---|
|
||
| **Tracks** | mountain_track **only** | generated_track + mountain_track (round-robin) |
|
||
| **Env setup** | `VecTransposeImage(DummyVecEnv([make_env]))` — created ONCE, never closed | `wrap_env(raw)` passed to PPO, which auto-wraps; **closed and reopened** every 6k steps |
|
||
| **Track switching** | None — single env for entire 90k steps | `close_and_switch()` — close env, exit_scene, sleep, gym.make new track |
|
||
| **PPO continuity** | Single `model.learn()` calls with `reset_num_timesteps=False`, same env | `model.learn()` + `model.set_env(new_env)` after each switch |
|
||
| **Eval between segments** | Direct `env.reset()` + predict loop on same env | Same, but env may be a different track than what was just trained |
|
||
| **Best model selection** | Based on eval reward on mountain_track | Based on segment reward — could be from either track |
|
||
|
||
**Conclusion:** Exp 9 kept a single persistent env connection for all 90k steps.
|
||
Exp 10 closed and reopened the env every 6k steps with `model.set_env()`.
|
||
This likely disrupts PPO's rollout buffer, value estimates, and observation normalization.
|
||
Exp 9 was a completely different (simpler) script with no track switching at all.
|
||
|
||
### Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work?
|
||
|
||
Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME
|
||
`multitrack_runner.py` code — yet Trial 9 scored 1435 on mini_monaco (zero-shot)
|
||
while Exp 10 crashes on every track at <180 steps.
|
||
|
||
**Wave 4 Trial 9 parameters:**
|
||
- lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893
|
||
- Trained on generated_track + mountain_track (same as Exp 10)
|
||
- Used `multitrack_runner.py` via CLI subprocess (same close_and_switch logic)
|
||
|
||
**Exp 10 parameters:**
|
||
- lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000
|
||
- Nearly identical to Trial 9
|
||
|
||
**But Wave 4 was mostly failures too:**
|
||
|
||
| Metric | Value |
|
||
|---|---|
|
||
| Total Wave 4 trials | 25 |
|
||
| Scores > 500 | 4 / 25 (16%) |
|
||
| Scores > 200 | 5 / 25 (20%) |
|
||
| Median score | 111.3 |
|
||
| Mean score | 343.8 |
|
||
| Std deviation | 566.2 |
|
||
|
||
The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials
|
||
scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch
|
||
and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when
|
||
re-tested (see STATE.md).
|
||
|
||
**The real conclusion:** Trial 9's success was likely due to **lucky random
|
||
initialization of CNN weights**. With 80% of trials failing under the same
|
||
training methodology, the multitrack round-robin approach via close_and_switch
|
||
is fundamentally unreliable. The few successes are random seed lottery winners,
|
||
not evidence that the method works.
|
||
|
||
**Wave 5 reproduction attempt:** We tried training on generated_track only
|
||
(single track, no switching, same lr=0.000725, 90k steps) to test whether
|
||
the track-switching was the problem. Result stored in `models/wave5-gentrack-only/`.
|
||
(Results were poor — could not reproduce Trial 9's quality.)
|
||
|
||
**Open question:** Is there a reliable way to do multi-track training, or
|
||
should we focus on single-track training with domain randomization (lighting,
|
||
camera angle) to achieve generalization instead?
|
||
|
||
### Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED)
|
||
- **Date:** 2026-04-19
|
||
- **Change from Exp10:** Two sim instances (port 9091 + 9093), DummyVecEnv wraps both.
|
||
PPO sees both tracks in every rollout batch. No close_and_switch.
|
||
- **Tracks:** generated_track (9091) + mountain_track (9093)
|
||
- **Reward:** v5 (speed × CTE) — same as Exp 9/10
|
||
- **Result:** ABORTED at 66k/90k steps. Circular driving observed on generated_track.
|
||
v5 reward has no efficiency term → circles at CTE≈0 earn positive reward.
|
||
- **Positive:** Parallel env infrastructure works! Both sims connected, PPO trained
|
||
stably with no env switching issues. Consistent improvement 14.7→67.8 combined.
|
||
- **Negative:** Circular driving exploit returned because v5 dropped efficiency.
|
||
|
||
### Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate)
|
||
- **Date:** 2026-04-19
|
||
- **Change from Exp11:** Reward v6 (speed × CTE + efficiency gate ≥ 0.15).
|
||
Also stuck_steps 80→40 (faster stuck termination).
|
||
- **Tracks:** generated_track (9091) + mountain_track (9093)
|
||
- **Total steps:** 90,000 | lr=0.000725 | throttle_min=0.2
|
||
|
||
**Training progress (eval at each 6k checkpoint):**
|
||
|
||
| Steps | gen_track | mountain | Combined | Note |
|
||
|---|---|---|---|---|
|
||
| 6k | 91s | 130s | 10.7r | Early |
|
||
| 18k | 100s | 100s | 15.9r | Improving |
|
||
| 36k | 161s | 160s | 26.2r | ⭐ |
|
||
| 42k | 160s | 159s | 28.9r | ⭐ |
|
||
| 60k | 164s | 163s | — | Plateau |
|
||
| 78k | 169s | 168s | 29.2r | ⭐ |
|
||
| 90k | 173s | 172s | — | End |
|
||
|
||
**Evaluation results (best_model, 3 sets per track):**
|
||
|
||
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|
||
|---|---|---|---|---|---|
|
||
| mountain_track (trained) | 195 | 196 | 192 | **194** | ❌ |
|
||
| generated_track (trained) | 192 | 194 | 192 | **193** | ❌ |
|
||
| generated_road (zero-shot) | 192 | 196 | 194 | **194** | ❌ |
|
||
| mini_monaco (zero-shot) | 194 | 192 | 196 | **194** | ❌ |
|
||
|
||
**Analysis:**
|
||
- ✅ No circular driving (efficiency gate works)
|
||
- ✅ Remarkably consistent: all tracks ~194 steps, very low variance
|
||
- ✅ Parallel env infrastructure is stable and reliable
|
||
- ❌ Model plateaus at ~170-195 steps and never improves past that
|
||
- ❌ Much worse than Exp 9 (mountain only: 2000/2000) or Wave 4 Trial 9 (2000/2000)
|
||
- The consistency across all 4 tracks (including zero-shot) suggests the model
|
||
learned a generic short-drive policy, not track-specific features
|
||
- Possible cause: 90k steps may be insufficient for 2-env parallel training
|
||
(effective steps per track = 45k each), or the efficiency gate may be
|
||
suppressing early exploration
|
||
|
||
**Key findings:**
|
||
1. Parallel DummyVecEnv works mechanically — this is the right infrastructure
|
||
2. v6 reward prevents circular driving
|
||
3. But 90k steps with 2 parallel envs may not be enough training budget
|
||
4. Compare: Exp 9 (single track, 90k steps, v5) → 2000 steps. Exp 11b
|
||
(2 tracks, 90k steps, v6) → 194 steps. The training budget per track
|
||
is halved AND the reward is harder to exploit.
|
||
|
||
**Next experiments to consider:**
|
||
- Increase total_timesteps to 180k-250k (restore per-track budget)
|
||
- Try v6 reward on single track first to isolate reward vs multi-track effects
|
||
- Try v5 reward with parallel envs but longer training (accept some circling)
|
||
- Check if efficiency gate triggers too aggressively during normal cornering
|
||
|
||
---
|
||
|
||
## Exp 14b — Mountain finetune from exp14 champion (2026-04-19)
|
||
|
||
- **Script:** `agent/experiments/exp14_finetune_v5.py`
|
||
- **Warm start:** `agent/models/exp14-mountain-v5/best_model.zip`
|
||
- **Schedule:**
|
||
- phase 1: runtime throttle floor `0.4`
|
||
- phase 2: runtime throttle floor `0.2`
|
||
- **Goal:** improve hill climbing, robustness, and lap time on `mountain_track`
|
||
|
||
### Important outcome
|
||
The finetune run **did not improve monotonically**. It briefly improved, then later degraded badly.
|
||
This means the final/latest checkpoint is **not** the model we want to keep.
|
||
|
||
### Candidate checkpoint comparison
|
||
We ran a fresh deterministic comparison on mountain only:
|
||
- **9 episodes per model**
|
||
- **2000 step cap**
|
||
- Results saved to:
|
||
- `agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl`
|
||
- `agent/outerloop-results/mountain_candidate_eval_2026-04-19.md`
|
||
|
||
| Model | Floor | Success eps | Full 2k eps | Avg laps/ep | Total laps | Mean lap | Best lap | Avg steps | Verdict |
|
||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---|
|
||
| exp14_base | 0.2 | 7/9 | 3/9 | 1.78 | 16 | 29.24s | 27.02s | 1332 | Original champion |
|
||
| ft_006k | 0.4 | 1/9 | 0/9 | 0.11 | 1 | 21.36s | 21.36s | 335 | Very fast but unusably fragile |
|
||
| ft_024k | 0.4 | 4/9 | 0/9 | 0.56 | 5 | 21.58s | 20.53s | 575 | Fast but fragile |
|
||
| ft_030k | 0.4 | 1/9 | 0/9 | 0.22 | 2 | 21.53s | 20.72s | 317 | Very fast but unusably fragile |
|
||
| **ft_036k** | **0.2** | **9/9** | **6/9** | **2.78** | **25** | **27.93s** | **26.16s** | **1841** | **Best overall balance** |
|
||
| ft_042k | 0.2 | 8/9 | 4/9 | 1.89 | 17 | 29.25s | 27.09s | 1404 | Decent, but worse than 36k |
|
||
| ft_048k | 0.2 | 6/9 | 3/9 | 1.44 | 13 | 31.15s | 28.31s | 1127 | Degraded |
|
||
|
||
### Best model captured
|
||
Best overall checkpoint from the finetune:
|
||
- `agent/models/exp14-mountain-v5-finetune/checkpoint_0036000.zip`
|
||
|
||
Promoted copy saved as:
|
||
- `agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip`
|
||
|
||
### Key learning
|
||
- Early `0.4`-floor checkpoints can produce very fast laps, but are too fragile to trust.
|
||
- The best mountain finetune model is the **36k checkpoint after switching back to 0.2 floor**, not the later checkpoints.
|
||
- Later finetune checkpoints collapsed badly, matching the user's visual observation of wheelspin / poor driving.
|
||
|
||
---
|
||
|
||
## Exp 15 — Generated track warm-start from mountain champion (2026-04-19)
|
||
|
||
- **Script:** `agent/experiments/exp15_gentrack_from_mountain.py`
|
||
- **Warm start:** `agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip`
|
||
- **Target track:** `generated_track`
|
||
- **Target setup:** Exp 13-style v4 generated-track training
|
||
- **Result:** ❌ Failed
|
||
|
||
**Observed behavior:**
|
||
- Model tried exploit-like behavior near the start / first corner
|
||
- Did not learn clean generated-track driving
|
||
- By ~25k steps, it was clearly far behind the known-good scratch run
|
||
|
||
**Log evidence:**
|
||
- `[20,000] reward=45.0 steps=47 laps=0`
|
||
- `[25,000] reward=23.4 steps=30 laps=0`
|
||
- Short exploit laps appeared in the log (`6.5s`, `4.91s`)
|
||
|
||
**Conclusion:**
|
||
- Mountain → generated warm-start transfer is poor in this direct setup
|
||
- The mountain policy prior seems to bias the agent toward bad local behavior instead of helping generated-track learning
|
||
|
||
---
|
||
|
||
## Exp 16 — Mountain track warm-start from generated champion (2026-04-19)
|
||
|
||
- **Script:** `agent/experiments/exp16_mountain_from_gentrack.py`
|
||
- **Warm start:** `agent/models/exp13-gentrack-v4/best_model.zip`
|
||
- **Target track:** `mountain_track`
|
||
- **Target setup:** Exp 14-style v5 mountain training
|
||
- **Result:** ❌ Failed
|
||
|
||
**Observed behavior:**
|
||
- No meaningful mountain learning
|
||
- Repeated short crash pattern
|
||
- Never developed lap-completing mountain behavior
|
||
|
||
**Log evidence:**
|
||
- `[210,000] reward=10.2 steps=195 laps=0`
|
||
- `[215,000] reward=10.1 steps=193 laps=0`
|
||
|
||
**Conclusion:**
|
||
- Generated → mountain warm-start transfer is also poor in this direct setup
|
||
- The generated-track champion does not bootstrap mountain hill learning effectively here
|
||
|
||
---
|
||
|
||
## Transfer-learning takeaway (current evidence)
|
||
|
||
Direct cross-track warm starts failed in **both** directions:
|
||
- mountain → generated: failed / exploit-prone
|
||
- generated → mountain: failed / short-crash plateau
|
||
|
||
Current interpretation:
|
||
- the single-track policies are too specialized for naive direct transfer, and/or
|
||
- the mountain sim physics differences are large enough to break transfer
|
||
|
||
For now:
|
||
- keep the single-track champions as separate specialists
|
||
- do **not** assume direct cross-track warm starts are beneficial
|
||
|
||
---
|
||
|
||
## Mountain Track Friction Fix (2026-04-27)
|
||
|
||
### Root cause
|
||
|
||
`WheelPhys.cs` scales wheel grip by the static friction of whatever surface the
|
||
wheel is touching: `fFriction.stiffness = hit.collider.material.staticFriction * originalForwardStiffness`.
|
||
|
||
`mountain_track.unity` assigned the Slippery physics material (staticFriction=0.1)
|
||
to 4 track surface colliders from the `long_road` prefab. This gave the car 1/5
|
||
the normal grip on the hill, causing visible wheelspin even at full throttle.
|
||
|
||
The Slippery material is intentional on genuinely icy surfaces (thunderhill) but
|
||
was incorrect on mountain_track's asphalt hill.
|
||
|
||
### Fix applied
|
||
|
||
Replaced all 4 Slippery material assignments with Road material (staticFriction=0.5)
|
||
in `sdsim/Assets/Scenes/mountain_track.unity`.
|
||
|
||
| Material | staticFriction | GUID |
|
||
|---|---|---|
|
||
| Slippery (removed) | 0.1 | c0e12c099c364af4e9e311a43d0f12c4 |
|
||
| Road (applied) | 0.5 | 7884193b0ead347a38a13a67f294dfb5 |
|
||
|
||
### To activate
|
||
|
||
The training setup uses the pre-built Windows executable (`DonkeySimWin/donkey_sim.exe`),
|
||
not a locally-compiled build. The scene file edit in sdsandbox/ has no effect on the
|
||
running binary — it only matters if the sim is ever rebuilt from source in Unity Editor.
|
||
|
||
**This fix is deferred.** Proceed with Exp 17 using the existing executable.
|
||
If mountain hill training in Exp 17 specifically struggles (short episodes that plateau
|
||
and never improve), that is the signal to pursue a Unity Editor rebuild.
|
||
|
||
The scene file change is committed in sdsandbox/ and will apply automatically if the
|
||
sim is rebuilt for any other reason. No Python code changes needed.
|
||
|
||
### Expected effect
|
||
|
||
- Hill wheelspin should stop or greatly reduce
|
||
- Throttle_min=0.2 + v5 reward should be even more effective on the hill
|
||
- All future mountain experiments benefit; no code changes needed
|
||
|
||
---
|
||
|
||
## Strategy Review and Exp 17 Plan (2026-04-27)
|
||
|
||
### Where the project stands
|
||
|
||
After 16 experiments and 4 autoresearch phases, the core problem is clear:
|
||
multi-track training is needed for generalisation, but the training method has
|
||
been unreliable. Here is the summary of what each approach found:
|
||
|
||
| Approach | Outcome |
|
||
|---|---|
|
||
| Round-robin close-and-switch (Wave 4, Exp 10) | 80% failure. PPO rollout buffer disrupted on env swap. Lucky seed (Trial 9) worked once but cannot be reproduced. |
|
||
| Parallel DummyVecEnv 90k steps (Exp 11b) | Infrastructure valid, no catastrophic forgetting, but 90k steps / 2 tracks = ~45k effective per track. Not enough. |
|
||
| Cross-track warm starts (Exp 15, 16) | Both directions failed. Single-track specialists do not transfer cleanly. |
|
||
| Single-track PPO (Exp 9, 13, 14) | Reliable but no generalisation. |
|
||
|
||
The conclusion: **parallel DummyVecEnv is the right architecture; the only known
|
||
failure mode is training budget**. Exp 11b was mechanically sound but starved of steps.
|
||
|
||
### Exp 17 — Parallel DummyVecEnv, 400k–500k steps
|
||
|
||
**This is the primary next experiment.**
|
||
|
||
| Parameter | Value | Reason |
|
||
|---|---|---|
|
||
| Architecture | DummyVecEnv([generated_track:9091, mountain_track:9093]) | Validated in Exp 11b; no PPO disruption |
|
||
| Total timesteps | 400,000–500,000 | ~200k effective per track; Exp 11b proved 90k insufficient |
|
||
| Reward | v6 on both envs (efficiency gate + CTE patience terminator) | Blocks circular exploit on generated_track; gate threshold may be tuned |
|
||
| throttle_min | 0.2 both envs (or 0.5 mountain, 0.2 generated — see ADR-020) | v5/v6 gradient non-zero on hills at 0.2 |
|
||
| learning_rate | 0.000725 | From Trial 9 and Exp 9 — consistent with best results |
|
||
| Checkpoint | every 20,000 steps + best_model.zip tracked throughout | ADR-017: best model ≠ final model |
|
||
| Eval | mini_monaco zero-shot at every checkpoint | Detect the peak before policy drifts |
|
||
| Warm start | None — train from random weights | ADR-024: cross-track warm starts failed |
|
||
|
||
**Setup checklist before running:**
|
||
1. Two sim instances running: one on port 9091, one on port 9093
|
||
2. Both on the same track as configured (generated_track and mountain_track)
|
||
3. Rebuild simulator with mountain friction fix active
|
||
4. Verify throughput: run 2-minute timing benchmark, set step cap accordingly (ADR-014)
|
||
|
||
**Success criterion:** mini_monaco zero-shot score > 500 (at least 25% of a full
|
||
2000-step episode) reliably across 3 evaluation sets, reproducible across 2+ runs.
|
||
|
||
### Fallback: Curriculum training (if Exp 17 plateaus below 200)
|
||
|
||
If Exp 17 cannot get past ~200 steps on mini_monaco:
|
||
- Phase A: generated_track only, 150k steps (establish road-following)
|
||
- Phase B: add mountain_track to DummyVecEnv, continue 250k more steps
|
||
- Rationale: gives the policy a foundation before the harder mountain physics
|
||
|
||
### Fallback: v6 efficiency gate tuning (if gate is too aggressive)
|
||
|
||
Log what fraction of steps are gated (reward zeroed) in the first 100k steps.
|
||
If >40%, lower the gate threshold from 0.15 to 0.10 for the first 150k steps,
|
||
then raise it back to 0.15. Prevents the gate from suppressing early exploration.
|
||
|