32 KiB

Raw Blame History

Test History — DonkeyCar RL Autoresearch

Last updated: 2026-04-19

This document records every significant training experiment, what was changed, what was observed, and what was learned. Use this to make methodical decisions rather than random changes.

Baseline Models (Phase 1 & 2)

Phase 2 Champion

Model: models/champion/model.zip
Track trained on: generated_road only
Steps: 13,328
Hyperparams: lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward
Result: ✅ Drives generated_road perfectly, stays in right lane
Zero-shot: ❌ Fails on generated_track (confirmed), ❌ Fails on mini_monaco
Notes: Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps)

Mountain Track Experiments

All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps

Exp 1 — Mountain track, old v4 reward, throttle_min=0.2

Reward: v4 (CTE × efficiency × speed)
throttle_min: 0.2
Key observation: Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power
Root cause: v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle")
Learned: v4 reward is broken for inclined terrain

Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k

Reward: v4
throttle_min: 0.2
Key observation: Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn
Result: ❌ Killed early — no improvement
Learned: More steps alone cannot fix a broken reward signal

Exp 3 — Mountain track, old v4 reward, throttle_min=0.5

Reward: v4
throttle_min: 0.5 (increased to overcome hill)
Key observation: Circle exploit dominated entire run — 0.5-1.75 second laps throughout
Lap times logged: All short (exploit)
Result: ❌ Model useless (reward=4.99 after 90k steps)
Learned: Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high

Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5

Reward: v4
throttle_min: 0.5
Key observation: Killed early — same 2 behaviors (left into barrier, straight into barrier)
Result: ❌ Killed
Learned: Continuing bad training does not help

Exp 5 — Mountain track, v5 reward, throttle_min=0.5 ⭐ KEY EXPERIMENT

Reward: v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills
throttle_min: 0.5
Method: Direct model.learn() — NO train_multitrack(), ONE connection throughout
Key observation: Genuine 20-22 second laps appearing from step ~30,000 onward
Lap times: 19-22 seconds (genuine), consistently for 60k steps
Result: ❌ Final model poor — best model was at step ~30k but we only saved final (step 90k) model
Root cause of failure: No best-model saving. Policy peaked at 30k, diverged by 90k
Learned:
1. v5 reward WORKS for mountain track
2. throttle_min=0.5 WORKS for hill
3. Direct model.learn() (no track switching) avoids phantom car issues
4. MUST save best model during training, not just final

Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment)

Reward: v5
throttle_min: 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments)
Method: train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint)
Key observation: Circle exploit dominated — only 0.5-1.75 second laps throughout
Result: ❌ Only 1 checkpoint saved (at step 90k). Best reward=4.99
Root cause: Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation)
Learned: steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing

Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed

Reward: v5 + short-lap now TERMINATES episode (not just penalty)
throttle_min: 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default)
Method: train_multitrack() with steps_per_switch=6000 (15 segments)
Key observation: Car in LEFT lane, sitting doing nothing. Not normal spawn position.
Hypothesis: Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car.
Result: ❌ Killed — phantom car issue
Learned:
1. close_and_switch() between segments creates phantom car risk for single-track training
2. throttle_min MUST be passed consistently — module default is 0.2, not 0.5
3. For single-track training: do NOT use close_and_switch() at all

Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW)

Reward: v5 + short-lap terminates episode
throttle_min: 0.5 throughout (no close_and_switch = no module default override)
Method: Direct model.learn() in loop — ONE connection throughout entire run
Checkpoints: 15 numbered saves (every 6,000 steps) + best_model.zip
PID: 2941877, log: /tmp/exp8.log
Status: Running since 11:17, ~1h45m total
Watch: tail -f /tmp/exp8.log
Success criteria: Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval

Wave 4 Multi-Track Experiments (generated_track + mountain_track)

Trial 9 ⭐ BEST OVERALL MODEL

Model: models/wave4-trial-0009/model.zip
Tracks: generated_track + mountain_track (round-robin, switch every 6,851 steps)
Steps: 89,893 total (~45k per track)
Hyperparams: lr=0.000725, switch=6,851
Reward: v4 (old — before exploit patches)
Result:
- ✅ Drives generated_track (3/3 episodes, 13-16 second genuine laps)
- ✅ Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training)
- ❌ Crashes on mountain_track (~200 steps — hill + corner)
- ❌ Crashes on generated_road (~46 steps — turns right immediately)
Notes: Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model.

Wave 4 Other Trials (1-25 except Trial 9)

Result: All crashed on mini_monaco within 20-265 steps
Median mini_monaco score: ~112 (crashes at ~130 steps)
Trials 14, 25: Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight)
Learned: Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably.

Key Decisions Made (What We Keep)

Decision	Reason
v5 reward: `speed × CTE-quality`	Directly incentivises throttle on hills. v4 gave zero gradient on inclines.
throttle_min=0.5 for mountain_track	Overcomes hill. Car can now reach first corner.
Short-lap penalty + EPISODE TERMINATION	Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable.
Numbered checkpoints every segment	Never lose a good mid-training model again (ADR-017)
best_model.zip updated on new best segment score	Final model ≠ best model. Peak can be at 30k even if final is at 90k.
Single TCP connection for single-track training	Avoids phantom car problem from close_and_switch()
lr=0.000725	From Trial 9 (best model). Consistent with good results.

Key Problems Still Open

Problem	Status
Mountain track circle exploit	Partially fixed — episode termination added. Exp8 will show if it holds.
Mountain track — car can't navigate first corner reliably	Still being investigated. Exp5 showed genuine laps so it IS solvable.
Multi-track generalization is random-seed dependent	No reliable solution yet. Trial 9 was lucky.
Mountain track model doesn't generalise to other tracks	Expected — single track training generalises poorly. Next step after Exp8 succeeds.

Next Steps (Proposed, Not Yet Run)

Exp 8 result: If best_model.zip drives mountain_track reliably → proceed to Step 2
Combine mountain_track + generated_track using v5 reward, throttle_min=0.5, proper checkpointing
Test combined model on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
If yes: We have reproduced Trial 9 reliably with a better reward function

Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED

Reward: v5 (speed × CTE-quality)
throttle_min: 0.5
Method: Direct model.learn() loop, single TCP connection, NO close_and_switch
Steps: 90,000 total | 6,000 per segment | 15 checkpoints
Circle exploit fix: Short-lap terminates episode immediately
Peak segment: Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
Policy diverged: Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
Checkpoints saved: checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
Final eval results using best_model.zip (step 60k weights):

Track	Ep1	Ep2	Ep3	Mean steps	Result
mountain_track (training)	382	529	182	364	❌ crashes
generated_track (zero-shot)	63	61	61	62	❌ crashes
mini_monaco (zero-shot)	154	155	104	138	❌ crashes at one corner
generated_road (zero-shot)	41	42	41	41	❌ crashes

Throttle test: mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
Key findings:
1. ✅ Circle exploit fully eliminated — no short laps observed
2. ✅ Best model saving worked — captured step 60k peak, not step 90k drift
3. ✅ Genuine 20-22 second laps during training from step ~18k onward
4. ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
5. ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor

Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)

Change from Exp8: throttle_min: 0.5 → 0.2 (only change)
Reward: v5 (speed × CTE-quality) — UNCHANGED
Hypothesis: v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero). Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0] and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
What we never tested: (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
Risk: Model may still stall on hill if gradient convergence is slow in early training. StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
Next test (Exp10): Add track_progress bonus to reward (v6) — one variable at a time.
Save dir: models/exp9-mountain-v5-throttle02/
Watch: tail -f /tmp/exp9.log

Exp 9 — Evaluation Results (3-set test, 1 run per track per set)

Model tested: models/exp9-mountain-v5-throttle02/best_model.zip Date: 2026-04-18 Test setup: 3 independent sets, lighting randomises each run (no fixed seed)

Track	Set 1	Set 2	Set 3	Mean	Pattern
mountain_track (trained)	✅ 2000	✅ 2000	✅ 2000	2000	Rock solid
generated_track (zero-shot)	❌ 79	❌ 61	❌ 82	74	Always fails — can't make first corner
generated_road (zero-shot)	❌ 651	✅ 2000	❌ 1203	1285	Highly variable — lighting dependent
mini_monaco (zero-shot)	❌ 32	❌ 60	❌ 34	42	Always fails — veers right immediately

User observations:

mountain_track: 80-90% of time on or near centre yellow line. Solid driving.
generated_road: Driving looks good when it works, but goes off course. Lighting variation causes inconsistency.
generated_track: Cannot make first corner at all. Model sees nothing it recognises.
mini_monaco: Veers right immediately at start before any visible driving. Crashes before reaching the road.

Key finding — Lighting effect confirmed: Generated_road varies 651→2000→1203 with identical model and track. ONLY lighting changes. Mountain_track is immune because it trained under many random lighting conditions. Generated_track and mini_monaco fail regardless of lighting — visual domain too different.

What this tells us about next steps: Train on mountain_track + generated_track together (v5 reward, throttle_min=0.2). Both tracks have random lighting each episode → model forced to learn lighting-invariant features. Goal: model that is reliable on both training tracks, then test generalisation to generated_road and mini_monaco.

Exp 10 — Two tracks: generated_track + mountain_track, v5 reward, throttle_min=0.2

Change from Exp9: Added generated_track as second training track
Reward: v5 (speed × CTE) — unchanged
throttle_min: 0.2 — unchanged from Exp9
Training tracks: generated_track + mountain_track (round-robin, switch every 6,000 steps)
Total steps: 90,000 | Steps per switch: 6,000 | ~7.5 rotations through both tracks
lr: 0.000725 — unchanged
Hypothesis: Adding generated_track visual diversity forces model to learn lighting-invariant road-following features. Mountain_track teaches hill throttle. Together should generalise better to generated_road and potentially mini_monaco.
Expected results: mountain_track reliable, generated_track reliable, generated_road improved, mini_monaco TBD
This is essentially Trial 9 repeated with: v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix

Exp 10 — Evaluation Results (3-set test, 2026-04-19)

Model tested: models/exp10-two-tracks/best_model.zip Result: TOTAL FAILURE — crashes on every track, every set.

Track	Set 1	Set 2	Set 3	Mean	Verdict
mountain_track (trained)	178	179	179	179	❌ Crashes at same spot every time
generated_track (trained)	99	82	88	90	❌ Crashes almost immediately
generated_road (zero-shot)	135	223	105	154	❌ Crashes early
mini_monaco (zero-shot)	111	133	129	124	❌ Crashes early

Comparison to previous best models:

Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. 91% regression.
Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124.

Analysis:

The round-robin track switching every 6,000 steps via multitrack_runner.train_multitrack() produced a model that learned NEITHER track. This is catastrophic interference.
Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in HOW the environment switching works — multitrack_runner closes and reopens envs, potentially disrupting PPO's rollout buffer and value function estimates.
Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has learned a fixed degenerate policy (always turn one direction) rather than responding to vision.

Key question: Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed? Possible causes: (1) env close/reopen resets PPO internal state, (2) best_model selection criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller.

Full log: agent/test-results/2026-04-19_10-15_exp10-two-tracks.log

Exp 9 vs Exp 10 — Root Cause Analysis

Aspect	Exp 9 (worked ✅)	Exp 10 (failed ❌)
Tracks	mountain_track only	generated_track + mountain_track (round-robin)
Env setup	`VecTransposeImage(DummyVecEnv([make_env]))` — created ONCE, never closed	`wrap_env(raw)` passed to PPO, which auto-wraps; closed and reopened every 6k steps
Track switching	None — single env for entire 90k steps	`close_and_switch()` — close env, exit_scene, sleep, gym.make new track
PPO continuity	Single `model.learn()` calls with `reset_num_timesteps=False`, same env	`model.learn()` + `model.set_env(new_env)` after each switch
Eval between segments	Direct `env.reset()` + predict loop on same env	Same, but env may be a different track than what was just trained
Best model selection	Based on eval reward on mountain_track	Based on segment reward — could be from either track

Conclusion: Exp 9 kept a single persistent env connection for all 90k steps. Exp 10 closed and reopened the env every 6k steps with model.set_env(). This likely disrupts PPO's rollout buffer, value estimates, and observation normalization. Exp 9 was a completely different (simpler) script with no track switching at all.

Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work?

Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME multitrack_runner.py code — yet Trial 9 scored 1435 on mini_monaco (zero-shot) while Exp 10 crashes on every track at <180 steps.

Wave 4 Trial 9 parameters:

lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893
Trained on generated_track + mountain_track (same as Exp 10)
Used multitrack_runner.py via CLI subprocess (same close_and_switch logic)

Exp 10 parameters:

lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000
Nearly identical to Trial 9

But Wave 4 was mostly failures too:

Metric	Value
Total Wave 4 trials	25
Scores > 500	4 / 25 (16%)
Scores > 200	5 / 25 (20%)
Median score	111.3
Mean score	343.8
Std deviation	566.2

The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when re-tested (see STATE.md).

The real conclusion: Trial 9's success was likely due to lucky random initialization of CNN weights. With 80% of trials failing under the same training methodology, the multitrack round-robin approach via close_and_switch is fundamentally unreliable. The few successes are random seed lottery winners, not evidence that the method works.

Wave 5 reproduction attempt: We tried training on generated_track only (single track, no switching, same lr=0.000725, 90k steps) to test whether the track-switching was the problem. Result stored in models/wave5-gentrack-only/. (Results were poor — could not reproduce Trial 9's quality.)

Open question: Is there a reliable way to do multi-track training, or should we focus on single-track training with domain randomization (lighting, camera angle) to achieve generalization instead?

Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED)

Date: 2026-04-19
Change from Exp10: Two sim instances (port 9091 + 9093), DummyVecEnv wraps both. PPO sees both tracks in every rollout batch. No close_and_switch.
Tracks: generated_track (9091) + mountain_track (9093)
Reward: v5 (speed × CTE) — same as Exp 9/10
Result: ABORTED at 66k/90k steps. Circular driving observed on generated_track. v5 reward has no efficiency term → circles at CTE≈0 earn positive reward.
Positive: Parallel env infrastructure works! Both sims connected, PPO trained stably with no env switching issues. Consistent improvement 14.7→67.8 combined.
Negative: Circular driving exploit returned because v5 dropped efficiency.

Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate)

Date: 2026-04-19
Change from Exp11: Reward v6 (speed × CTE + efficiency gate ≥ 0.15). Also stuck_steps 80→40 (faster stuck termination).
Tracks: generated_track (9091) + mountain_track (9093)
Total steps: 90,000 | lr=0.000725 | throttle_min=0.2

Training progress (eval at each 6k checkpoint):

Steps	gen_track	mountain	Combined	Note
6k	91s	130s	10.7r	Early
18k	100s	100s	15.9r	Improving
36k	161s	160s	26.2r	⭐
42k	160s	159s	28.9r	⭐
60k	164s	163s	—	Plateau
78k	169s	168s	29.2r	⭐
90k	173s	172s	—	End

Evaluation results (best_model, 3 sets per track):

Track	Set 1	Set 2	Set 3	Mean	Verdict
mountain_track (trained)	195	196	192	194	❌
generated_track (trained)	192	194	192	193	❌
generated_road (zero-shot)	192	196	194	194	❌
mini_monaco (zero-shot)	194	192	196	194	❌

Analysis:

✅ No circular driving (efficiency gate works)
✅ Remarkably consistent: all tracks ~194 steps, very low variance
✅ Parallel env infrastructure is stable and reliable
❌ Model plateaus at ~170-195 steps and never improves past that
❌ Much worse than Exp 9 (mountain only: 2000/2000) or Wave 4 Trial 9 (2000/2000)
The consistency across all 4 tracks (including zero-shot) suggests the model learned a generic short-drive policy, not track-specific features
Possible cause: 90k steps may be insufficient for 2-env parallel training (effective steps per track = 45k each), or the efficiency gate may be suppressing early exploration

Key findings:

Parallel DummyVecEnv works mechanically — this is the right infrastructure
v6 reward prevents circular driving
But 90k steps with 2 parallel envs may not be enough training budget
Compare: Exp 9 (single track, 90k steps, v5) → 2000 steps. Exp 11b (2 tracks, 90k steps, v6) → 194 steps. The training budget per track is halved AND the reward is harder to exploit.

Next experiments to consider:

Increase total_timesteps to 180k-250k (restore per-track budget)
Try v6 reward on single track first to isolate reward vs multi-track effects
Try v5 reward with parallel envs but longer training (accept some circling)
Check if efficiency gate triggers too aggressively during normal cornering

Exp 14b — Mountain finetune from exp14 champion (2026-04-19)

Script: agent/experiments/exp14_finetune_v5.py
Warm start: agent/models/exp14-mountain-v5/best_model.zip
Schedule:
- phase 1: runtime throttle floor 0.4
- phase 2: runtime throttle floor 0.2
Goal: improve hill climbing, robustness, and lap time on mountain_track

Important outcome

The finetune run did not improve monotonically. It briefly improved, then later degraded badly. This means the final/latest checkpoint is not the model we want to keep.

Candidate checkpoint comparison

We ran a fresh deterministic comparison on mountain only:

9 episodes per model
2000 step cap
Results saved to:
- agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl
- agent/outerloop-results/mountain_candidate_eval_2026-04-19.md

Model	Floor	Success eps	Full 2k eps	Avg laps/ep	Total laps	Mean lap	Best lap	Avg steps	Verdict
exp14_base	0.2	7/9	3/9	1.78	16	29.24s	27.02s	1332	Original champion
ft_006k	0.4	1/9	0/9	0.11	1	21.36s	21.36s	335	Very fast but unusably fragile
ft_024k	0.4	4/9	0/9	0.56	5	21.58s	20.53s	575	Fast but fragile
ft_030k	0.4	1/9	0/9	0.22	2	21.53s	20.72s	317	Very fast but unusably fragile
ft_036k	0.2	9/9	6/9	2.78	25	27.93s	26.16s	1841	Best overall balance
ft_042k	0.2	8/9	4/9	1.89	17	29.25s	27.09s	1404	Decent, but worse than 36k
ft_048k	0.2	6/9	3/9	1.44	13	31.15s	28.31s	1127	Degraded

Best model captured

Best overall checkpoint from the finetune:

agent/models/exp14-mountain-v5-finetune/checkpoint_0036000.zip

Promoted copy saved as:

agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip

Key learning

Early 0.4-floor checkpoints can produce very fast laps, but are too fragile to trust.
The best mountain finetune model is the 36k checkpoint after switching back to 0.2 floor, not the later checkpoints.
Later finetune checkpoints collapsed badly, matching the user's visual observation of wheelspin / poor driving.

Exp 15 — Generated track warm-start from mountain champion (2026-04-19)

Script: agent/experiments/exp15_gentrack_from_mountain.py
Warm start: agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip
Target track: generated_track
Target setup: Exp 13-style v4 generated-track training
Result: ❌ Failed

Observed behavior:

Model tried exploit-like behavior near the start / first corner
Did not learn clean generated-track driving
By ~25k steps, it was clearly far behind the known-good scratch run

Log evidence:

[20,000] reward=45.0 steps=47 laps=0
[25,000] reward=23.4 steps=30 laps=0
Short exploit laps appeared in the log (6.5s, 4.91s)

Conclusion:

Mountain → generated warm-start transfer is poor in this direct setup
The mountain policy prior seems to bias the agent toward bad local behavior instead of helping generated-track learning

Exp 16 — Mountain track warm-start from generated champion (2026-04-19)

Script: agent/experiments/exp16_mountain_from_gentrack.py
Warm start: agent/models/exp13-gentrack-v4/best_model.zip
Target track: mountain_track
Target setup: Exp 14-style v5 mountain training
Result: ❌ Failed

Observed behavior:

No meaningful mountain learning
Repeated short crash pattern
Never developed lap-completing mountain behavior

Log evidence:

[210,000] reward=10.2 steps=195 laps=0
[215,000] reward=10.1 steps=193 laps=0

Conclusion:

Generated → mountain warm-start transfer is also poor in this direct setup
The generated-track champion does not bootstrap mountain hill learning effectively here

Transfer-learning takeaway (current evidence)

Direct cross-track warm starts failed in both directions:

mountain → generated: failed / exploit-prone
generated → mountain: failed / short-crash plateau

Current interpretation:

the single-track policies are too specialized for naive direct transfer, and/or
the mountain sim physics differences are large enough to break transfer

For now:

keep the single-track champions as separate specialists
do not assume direct cross-track warm starts are beneficial

Mountain Track Friction Fix (2026-04-27)

Root cause

WheelPhys.cs scales wheel grip by the static friction of whatever surface the wheel is touching: fFriction.stiffness = hit.collider.material.staticFriction * originalForwardStiffness.

mountain_track.unity assigned the Slippery physics material (staticFriction=0.1) to 4 track surface colliders from the long_road prefab. This gave the car 1/5 the normal grip on the hill, causing visible wheelspin even at full throttle.

The Slippery material is intentional on genuinely icy surfaces (thunderhill) but was incorrect on mountain_track's asphalt hill.

Fix applied

Replaced all 4 Slippery material assignments with Road material (staticFriction=0.5) in sdsim/Assets/Scenes/mountain_track.unity.

Material	staticFriction	GUID
Slippery (removed)	0.1	c0e12c099c364af4e9e311a43d0f12c4
Road (applied)	0.5	7884193b0ead347a38a13a67f294dfb5

To activate

The training setup uses the pre-built Windows executable (DonkeySimWin/donkey_sim.exe), not a locally-compiled build. The scene file edit in sdsandbox/ has no effect on the running binary — it only matters if the sim is ever rebuilt from source in Unity Editor.

This fix is deferred. Proceed with Exp 17 using the existing executable. If mountain hill training in Exp 17 specifically struggles (short episodes that plateau and never improve), that is the signal to pursue a Unity Editor rebuild.

The scene file change is committed in sdsandbox/ and will apply automatically if the sim is rebuilt for any other reason. No Python code changes needed.

Expected effect

Hill wheelspin should stop or greatly reduce
Throttle_min=0.2 + v5 reward should be even more effective on the hill
All future mountain experiments benefit; no code changes needed

Strategy Review and Exp 17 Plan (2026-04-27)

Where the project stands

After 16 experiments and 4 autoresearch phases, the core problem is clear: multi-track training is needed for generalisation, but the training method has been unreliable. Here is the summary of what each approach found:

Approach	Outcome
Round-robin close-and-switch (Wave 4, Exp 10)	80% failure. PPO rollout buffer disrupted on env swap. Lucky seed (Trial 9) worked once but cannot be reproduced.
Parallel DummyVecEnv 90k steps (Exp 11b)	Infrastructure valid, no catastrophic forgetting, but 90k steps / 2 tracks = ~45k effective per track. Not enough.
Cross-track warm starts (Exp 15, 16)	Both directions failed. Single-track specialists do not transfer cleanly.
Single-track PPO (Exp 9, 13, 14)	Reliable but no generalisation.

The conclusion: parallel DummyVecEnv is the right architecture; the only known failure mode is training budget. Exp 11b was mechanically sound but starved of steps.

Exp 17 — Parallel DummyVecEnv, 400k–500k steps

This is the primary next experiment.

Parameter	Value	Reason
Architecture	DummyVecEnv([generated_track:9091, mountain_track:9093])	Validated in Exp 11b; no PPO disruption
Total timesteps	400,000–500,000	~200k effective per track; Exp 11b proved 90k insufficient
Reward	v6 on both envs (efficiency gate + CTE patience terminator)	Blocks circular exploit on generated_track; gate threshold may be tuned
throttle_min	0.2 both envs (or 0.5 mountain, 0.2 generated — see ADR-020)	v5/v6 gradient non-zero on hills at 0.2
learning_rate	0.000725	From Trial 9 and Exp 9 — consistent with best results
Checkpoint	every 20,000 steps + best_model.zip tracked throughout	ADR-017: best model ≠ final model
Eval	mini_monaco zero-shot at every checkpoint	Detect the peak before policy drifts
Warm start	None — train from random weights	ADR-024: cross-track warm starts failed

Setup checklist before running:

Two sim instances running: one on port 9091, one on port 9093
Both on the same track as configured (generated_track and mountain_track)
Rebuild simulator with mountain friction fix active
Verify throughput: run 2-minute timing benchmark, set step cap accordingly (ADR-014)

Success criterion: mini_monaco zero-shot score > 500 (at least 25% of a full 2000-step episode) reliably across 3 evaluation sets, reproducible across 2+ runs.

Fallback: Curriculum training (if Exp 17 plateaus below 200)

If Exp 17 cannot get past ~200 steps on mini_monaco:

Phase A: generated_track only, 150k steps (establish road-following)
Phase B: add mountain_track to DummyVecEnv, continue 250k more steps
Rationale: gives the policy a foundation before the harder mountain physics

Fallback: v6 efficiency gate tuning (if gate is too aggressive)

Log what fraction of steps are gated (reward zeroed) in the first 100k steps. If >40%, lower the gate threshold from 0.15 to 0.10 for the first 150k steps, then raise it back to 0.15. Prevents the gate from suppressing early exploration.

32 KiB Raw Blame History Unescape Escape