22 KiB

Raw Blame History

Test History — DonkeyCar RL Autoresearch

Last updated: 2026-04-18

This document records every significant training experiment, what was changed, what was observed, and what was learned. Use this to make methodical decisions rather than random changes.

Baseline Models (Phase 1 & 2)

Phase 2 Champion

Model: models/champion/model.zip
Track trained on: generated_road only
Steps: 13,328
Hyperparams: lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward
Result: ✅ Drives generated_road perfectly, stays in right lane
Zero-shot: ❌ Fails on generated_track (confirmed), ❌ Fails on mini_monaco
Notes: Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps)

Mountain Track Experiments

All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps

Exp 1 — Mountain track, old v4 reward, throttle_min=0.2

Reward: v4 (CTE × efficiency × speed)
throttle_min: 0.2
Key observation: Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power
Root cause: v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle")
Learned: v4 reward is broken for inclined terrain

Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k

Reward: v4
throttle_min: 0.2
Key observation: Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn
Result: ❌ Killed early — no improvement
Learned: More steps alone cannot fix a broken reward signal

Exp 3 — Mountain track, old v4 reward, throttle_min=0.5

Reward: v4
throttle_min: 0.5 (increased to overcome hill)
Key observation: Circle exploit dominated entire run — 0.5-1.75 second laps throughout
Lap times logged: All short (exploit)
Result: ❌ Model useless (reward=4.99 after 90k steps)
Learned: Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high

Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5

Reward: v4
throttle_min: 0.5
Key observation: Killed early — same 2 behaviors (left into barrier, straight into barrier)
Result: ❌ Killed
Learned: Continuing bad training does not help

Exp 5 — Mountain track, v5 reward, throttle_min=0.5 ⭐ KEY EXPERIMENT

Reward: v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills
throttle_min: 0.5
Method: Direct model.learn() — NO train_multitrack(), ONE connection throughout
Key observation: Genuine 20-22 second laps appearing from step ~30,000 onward
Lap times: 19-22 seconds (genuine), consistently for 60k steps
Result: ❌ Final model poor — best model was at step ~30k but we only saved final (step 90k) model
Root cause of failure: No best-model saving. Policy peaked at 30k, diverged by 90k
Learned:
1. v5 reward WORKS for mountain track
2. throttle_min=0.5 WORKS for hill
3. Direct model.learn() (no track switching) avoids phantom car issues
4. MUST save best model during training, not just final

Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment)

Reward: v5
throttle_min: 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments)
Method: train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint)
Key observation: Circle exploit dominated — only 0.5-1.75 second laps throughout
Result: ❌ Only 1 checkpoint saved (at step 90k). Best reward=4.99
Root cause: Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation)
Learned: steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing

Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed

Reward: v5 + short-lap now TERMINATES episode (not just penalty)
throttle_min: 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default)
Method: train_multitrack() with steps_per_switch=6000 (15 segments)
Key observation: Car in LEFT lane, sitting doing nothing. Not normal spawn position.
Hypothesis: Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car.
Result: ❌ Killed — phantom car issue
Learned:
1. close_and_switch() between segments creates phantom car risk for single-track training
2. throttle_min MUST be passed consistently — module default is 0.2, not 0.5
3. For single-track training: do NOT use close_and_switch() at all

Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW)

Reward: v5 + short-lap terminates episode
throttle_min: 0.5 throughout (no close_and_switch = no module default override)
Method: Direct model.learn() in loop — ONE connection throughout entire run
Checkpoints: 15 numbered saves (every 6,000 steps) + best_model.zip
PID: 2941877, log: /tmp/exp8.log
Status: Running since 11:17, ~1h45m total
Watch: tail -f /tmp/exp8.log
Success criteria: Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval

Wave 4 Multi-Track Experiments (generated_track + mountain_track)

Trial 9 ⭐ BEST OVERALL MODEL

Model: models/wave4-trial-0009/model.zip
Tracks: generated_track + mountain_track (round-robin, switch every 6,851 steps)
Steps: 89,893 total (~45k per track)
Hyperparams: lr=0.000725, switch=6,851
Reward: v4 (old — before exploit patches)
Result:
- ✅ Drives generated_track (3/3 episodes, 13-16 second genuine laps)
- ✅ Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training)
- ❌ Crashes on mountain_track (~200 steps — hill + corner)
- ❌ Crashes on generated_road (~46 steps — turns right immediately)
Notes: Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model.

Wave 4 Other Trials (1-25 except Trial 9)

Result: All crashed on mini_monaco within 20-265 steps
Median mini_monaco score: ~112 (crashes at ~130 steps)
Trials 14, 25: Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight)
Learned: Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably.

Key Decisions Made (What We Keep)

Decision	Reason
v5 reward: `speed × CTE-quality`	Directly incentivises throttle on hills. v4 gave zero gradient on inclines.
throttle_min=0.5 for mountain_track	Overcomes hill. Car can now reach first corner.
Short-lap penalty + EPISODE TERMINATION	Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable.
Numbered checkpoints every segment	Never lose a good mid-training model again (ADR-017)
best_model.zip updated on new best segment score	Final model ≠ best model. Peak can be at 30k even if final is at 90k.
Single TCP connection for single-track training	Avoids phantom car problem from close_and_switch()
lr=0.000725	From Trial 9 (best model). Consistent with good results.

Key Problems Still Open

Problem	Status
Mountain track circle exploit	Partially fixed — episode termination added. Exp8 will show if it holds.
Mountain track — car can't navigate first corner reliably	Still being investigated. Exp5 showed genuine laps so it IS solvable.
Multi-track generalization is random-seed dependent	No reliable solution yet. Trial 9 was lucky.
Mountain track model doesn't generalise to other tracks	Expected — single track training generalises poorly. Next step after Exp8 succeeds.

Next Steps (Proposed, Not Yet Run)

Exp 8 result: If best_model.zip drives mountain_track reliably → proceed to Step 2
Combine mountain_track + generated_track using v5 reward, throttle_min=0.5, proper checkpointing
Test combined model on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
If yes: We have reproduced Trial 9 reliably with a better reward function

Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED

Reward: v5 (speed × CTE-quality)
throttle_min: 0.5
Method: Direct model.learn() loop, single TCP connection, NO close_and_switch
Steps: 90,000 total | 6,000 per segment | 15 checkpoints
Circle exploit fix: Short-lap terminates episode immediately
Peak segment: Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
Policy diverged: Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
Checkpoints saved: checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
Final eval results using best_model.zip (step 60k weights):

Track	Ep1	Ep2	Ep3	Mean steps	Result
mountain_track (training)	382	529	182	364	❌ crashes
generated_track (zero-shot)	63	61	61	62	❌ crashes
mini_monaco (zero-shot)	154	155	104	138	❌ crashes at one corner
generated_road (zero-shot)	41	42	41	41	❌ crashes

Throttle test: mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
Key findings:
1. ✅ Circle exploit fully eliminated — no short laps observed
2. ✅ Best model saving worked — captured step 60k peak, not step 90k drift
3. ✅ Genuine 20-22 second laps during training from step ~18k onward
4. ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
5. ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor

Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)

Change from Exp8: throttle_min: 0.5 → 0.2 (only change)
Reward: v5 (speed × CTE-quality) — UNCHANGED
Hypothesis: v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero). Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0] and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
What we never tested: (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
Risk: Model may still stall on hill if gradient convergence is slow in early training. StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
Next test (Exp10): Add track_progress bonus to reward (v6) — one variable at a time.
Save dir: models/exp9-mountain-v5-throttle02/
Watch: tail -f /tmp/exp9.log

Exp 9 — Evaluation Results (3-set test, 1 run per track per set)

Model tested: models/exp9-mountain-v5-throttle02/best_model.zip Date: 2026-04-18 Test setup: 3 independent sets, lighting randomises each run (no fixed seed)

Track	Set 1	Set 2	Set 3	Mean	Pattern
mountain_track (trained)	✅ 2000	✅ 2000	✅ 2000	2000	Rock solid
generated_track (zero-shot)	❌ 79	❌ 61	❌ 82	74	Always fails — can't make first corner
generated_road (zero-shot)	❌ 651	✅ 2000	❌ 1203	1285	Highly variable — lighting dependent
mini_monaco (zero-shot)	❌ 32	❌ 60	❌ 34	42	Always fails — veers right immediately

User observations:

mountain_track: 80-90% of time on or near centre yellow line. Solid driving.
generated_road: Driving looks good when it works, but goes off course. Lighting variation causes inconsistency.
generated_track: Cannot make first corner at all. Model sees nothing it recognises.
mini_monaco: Veers right immediately at start before any visible driving. Crashes before reaching the road.

Key finding — Lighting effect confirmed: Generated_road varies 651→2000→1203 with identical model and track. ONLY lighting changes. Mountain_track is immune because it trained under many random lighting conditions. Generated_track and mini_monaco fail regardless of lighting — visual domain too different.

What this tells us about next steps: Train on mountain_track + generated_track together (v5 reward, throttle_min=0.2). Both tracks have random lighting each episode → model forced to learn lighting-invariant features. Goal: model that is reliable on both training tracks, then test generalisation to generated_road and mini_monaco.

Exp 10 — Two tracks: generated_track + mountain_track, v5 reward, throttle_min=0.2

Change from Exp9: Added generated_track as second training track
Reward: v5 (speed × CTE) — unchanged
throttle_min: 0.2 — unchanged from Exp9
Training tracks: generated_track + mountain_track (round-robin, switch every 6,000 steps)
Total steps: 90,000 | Steps per switch: 6,000 | ~7.5 rotations through both tracks
lr: 0.000725 — unchanged
Hypothesis: Adding generated_track visual diversity forces model to learn lighting-invariant road-following features. Mountain_track teaches hill throttle. Together should generalise better to generated_road and potentially mini_monaco.
Expected results: mountain_track reliable, generated_track reliable, generated_road improved, mini_monaco TBD
This is essentially Trial 9 repeated with: v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix

Exp 10 — Evaluation Results (3-set test, 2026-04-19)

Model tested: models/exp10-two-tracks/best_model.zip Result: TOTAL FAILURE — crashes on every track, every set.

Track	Set 1	Set 2	Set 3	Mean	Verdict
mountain_track (trained)	178	179	179	179	❌ Crashes at same spot every time
generated_track (trained)	99	82	88	90	❌ Crashes almost immediately
generated_road (zero-shot)	135	223	105	154	❌ Crashes early
mini_monaco (zero-shot)	111	133	129	124	❌ Crashes early

Comparison to previous best models:

Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. 91% regression.
Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124.

Analysis:

The round-robin track switching every 6,000 steps via multitrack_runner.train_multitrack() produced a model that learned NEITHER track. This is catastrophic interference.
Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in HOW the environment switching works — multitrack_runner closes and reopens envs, potentially disrupting PPO's rollout buffer and value function estimates.
Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has learned a fixed degenerate policy (always turn one direction) rather than responding to vision.

Key question: Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed? Possible causes: (1) env close/reopen resets PPO internal state, (2) best_model selection criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller.

Full log: agent/test-results/2026-04-19_10-15_exp10-two-tracks.log

Exp 9 vs Exp 10 — Root Cause Analysis

Aspect	Exp 9 (worked ✅)	Exp 10 (failed ❌)
Tracks	mountain_track only	generated_track + mountain_track (round-robin)
Env setup	`VecTransposeImage(DummyVecEnv([make_env]))` — created ONCE, never closed	`wrap_env(raw)` passed to PPO, which auto-wraps; closed and reopened every 6k steps
Track switching	None — single env for entire 90k steps	`close_and_switch()` — close env, exit_scene, sleep, gym.make new track
PPO continuity	Single `model.learn()` calls with `reset_num_timesteps=False`, same env	`model.learn()` + `model.set_env(new_env)` after each switch
Eval between segments	Direct `env.reset()` + predict loop on same env	Same, but env may be a different track than what was just trained
Best model selection	Based on eval reward on mountain_track	Based on segment reward — could be from either track

Conclusion: Exp 9 kept a single persistent env connection for all 90k steps. Exp 10 closed and reopened the env every 6k steps with model.set_env(). This likely disrupts PPO's rollout buffer, value estimates, and observation normalization. Exp 9 was a completely different (simpler) script with no track switching at all.

Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work?

Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME multitrack_runner.py code — yet Trial 9 scored 1435 on mini_monaco (zero-shot) while Exp 10 crashes on every track at <180 steps.

Wave 4 Trial 9 parameters:

lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893
Trained on generated_track + mountain_track (same as Exp 10)
Used multitrack_runner.py via CLI subprocess (same close_and_switch logic)

Exp 10 parameters:

lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000
Nearly identical to Trial 9

But Wave 4 was mostly failures too:

Metric	Value
Total Wave 4 trials	25
Scores > 500	4 / 25 (16%)
Scores > 200	5 / 25 (20%)
Median score	111.3
Mean score	343.8
Std deviation	566.2

The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when re-tested (see STATE.md).

The real conclusion: Trial 9's success was likely due to lucky random initialization of CNN weights. With 80% of trials failing under the same training methodology, the multitrack round-robin approach via close_and_switch is fundamentally unreliable. The few successes are random seed lottery winners, not evidence that the method works.

Wave 5 reproduction attempt: We tried training on generated_track only (single track, no switching, same lr=0.000725, 90k steps) to test whether the track-switching was the problem. Result stored in models/wave5-gentrack-only/. (Results were poor — could not reproduce Trial 9's quality.)

Open question: Is there a reliable way to do multi-track training, or should we focus on single-track training with domain randomization (lighting, camera angle) to achieve generalization instead?

Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED)

Date: 2026-04-19
Change from Exp10: Two sim instances (port 9091 + 9093), DummyVecEnv wraps both. PPO sees both tracks in every rollout batch. No close_and_switch.
Tracks: generated_track (9091) + mountain_track (9093)
Reward: v5 (speed × CTE) — same as Exp 9/10
Result: ABORTED at 66k/90k steps. Circular driving observed on generated_track. v5 reward has no efficiency term → circles at CTE≈0 earn positive reward.
Positive: Parallel env infrastructure works! Both sims connected, PPO trained stably with no env switching issues. Consistent improvement 14.7→67.8 combined.
Negative: Circular driving exploit returned because v5 dropped efficiency.

Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate)

Date: 2026-04-19
Change from Exp11: Reward v6 (speed × CTE + efficiency gate ≥ 0.15). Also stuck_steps 80→40 (faster stuck termination).
Tracks: generated_track (9091) + mountain_track (9093)
Total steps: 90,000 | lr=0.000725 | throttle_min=0.2

Training progress (eval at each 6k checkpoint):

Steps	gen_track	mountain	Combined	Note
6k	91s	130s	10.7r	Early
18k	100s	100s	15.9r	Improving
36k	161s	160s	26.2r	⭐
42k	160s	159s	28.9r	⭐
60k	164s	163s	—	Plateau
78k	169s	168s	29.2r	⭐
90k	173s	172s	—	End

Evaluation results (best_model, 3 sets per track):

Track	Set 1	Set 2	Set 3	Mean	Verdict
mountain_track (trained)	195	196	192	194	❌
generated_track (trained)	192	194	192	193	❌
generated_road (zero-shot)	192	196	194	194	❌
mini_monaco (zero-shot)	194	192	196	194	❌