donkeycar-rl-autoresearch/docs/TEST_HISTORY.md

22 KiB
Raw Blame History

Test History — DonkeyCar RL Autoresearch

Last updated: 2026-04-18

This document records every significant training experiment, what was changed, what was observed, and what was learned. Use this to make methodical decisions rather than random changes.


Baseline Models (Phase 1 & 2)

Phase 2 Champion

  • Model: models/champion/model.zip
  • Track trained on: generated_road only
  • Steps: 13,328
  • Hyperparams: lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward
  • Result: Drives generated_road perfectly, stays in right lane
  • Zero-shot: Fails on generated_track (confirmed), Fails on mini_monaco
  • Notes: Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps)

Mountain Track Experiments

All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps

Exp 1 — Mountain track, old v4 reward, throttle_min=0.2

  • Reward: v4 (CTE × efficiency × speed)
  • throttle_min: 0.2
  • Key observation: Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power
  • Root cause: v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle")
  • Learned: v4 reward is broken for inclined terrain

Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k

  • Reward: v4
  • throttle_min: 0.2
  • Key observation: Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn
  • Result: Killed early — no improvement
  • Learned: More steps alone cannot fix a broken reward signal

Exp 3 — Mountain track, old v4 reward, throttle_min=0.5

  • Reward: v4
  • throttle_min: 0.5 (increased to overcome hill)
  • Key observation: Circle exploit dominated entire run — 0.5-1.75 second laps throughout
  • Lap times logged: All short (exploit)
  • Result: Model useless (reward=4.99 after 90k steps)
  • Learned: Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high

Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5

  • Reward: v4
  • throttle_min: 0.5
  • Key observation: Killed early — same 2 behaviors (left into barrier, straight into barrier)
  • Result: Killed
  • Learned: Continuing bad training does not help

Exp 5 — Mountain track, v5 reward, throttle_min=0.5 KEY EXPERIMENT

  • Reward: v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills
  • throttle_min: 0.5
  • Method: Direct model.learn() — NO train_multitrack(), ONE connection throughout
  • Key observation: Genuine 20-22 second laps appearing from step ~30,000 onward
  • Lap times: 19-22 seconds (genuine), consistently for 60k steps
  • Result: Final model poor — best model was at step ~30k but we only saved final (step 90k) model
  • Root cause of failure: No best-model saving. Policy peaked at 30k, diverged by 90k
  • Learned:
    1. v5 reward WORKS for mountain track
    2. throttle_min=0.5 WORKS for hill
    3. Direct model.learn() (no track switching) avoids phantom car issues
    4. MUST save best model during training, not just final

Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment)

  • Reward: v5
  • throttle_min: 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments)
  • Method: train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint)
  • Key observation: Circle exploit dominated — only 0.5-1.75 second laps throughout
  • Result: Only 1 checkpoint saved (at step 90k). Best reward=4.99
  • Root cause: Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation)
  • Learned: steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing

Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed

  • Reward: v5 + short-lap now TERMINATES episode (not just penalty)
  • throttle_min: 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default)
  • Method: train_multitrack() with steps_per_switch=6000 (15 segments)
  • Key observation: Car in LEFT lane, sitting doing nothing. Not normal spawn position.
  • Hypothesis: Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car.
  • Result: Killed — phantom car issue
  • Learned:
    1. close_and_switch() between segments creates phantom car risk for single-track training
    2. throttle_min MUST be passed consistently — module default is 0.2, not 0.5
    3. For single-track training: do NOT use close_and_switch() at all

Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW)

  • Reward: v5 + short-lap terminates episode
  • throttle_min: 0.5 throughout (no close_and_switch = no module default override)
  • Method: Direct model.learn() in loop — ONE connection throughout entire run
  • Checkpoints: 15 numbered saves (every 6,000 steps) + best_model.zip
  • PID: 2941877, log: /tmp/exp8.log
  • Status: Running since 11:17, ~1h45m total
  • Watch: tail -f /tmp/exp8.log
  • Success criteria: Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval

Wave 4 Multi-Track Experiments (generated_track + mountain_track)

Trial 9 BEST OVERALL MODEL

  • Model: models/wave4-trial-0009/model.zip
  • Tracks: generated_track + mountain_track (round-robin, switch every 6,851 steps)
  • Steps: 89,893 total (~45k per track)
  • Hyperparams: lr=0.000725, switch=6,851
  • Reward: v4 (old — before exploit patches)
  • Result:
    • Drives generated_track (3/3 episodes, 13-16 second genuine laps)
    • Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training)
    • Crashes on mountain_track (~200 steps — hill + corner)
    • Crashes on generated_road (~46 steps — turns right immediately)
  • Notes: Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model.

Wave 4 Other Trials (1-25 except Trial 9)

  • Result: All crashed on mini_monaco within 20-265 steps
  • Median mini_monaco score: ~112 (crashes at ~130 steps)
  • Trials 14, 25: Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight)
  • Learned: Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably.

Key Decisions Made (What We Keep)

Decision Reason
v5 reward: speed × CTE-quality Directly incentivises throttle on hills. v4 gave zero gradient on inclines.
throttle_min=0.5 for mountain_track Overcomes hill. Car can now reach first corner.
Short-lap penalty + EPISODE TERMINATION Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable.
Numbered checkpoints every segment Never lose a good mid-training model again (ADR-017)
best_model.zip updated on new best segment score Final model ≠ best model. Peak can be at 30k even if final is at 90k.
Single TCP connection for single-track training Avoids phantom car problem from close_and_switch()
lr=0.000725 From Trial 9 (best model). Consistent with good results.

Key Problems Still Open

Problem Status
Mountain track circle exploit Partially fixed — episode termination added. Exp8 will show if it holds.
Mountain track — car can't navigate first corner reliably Still being investigated. Exp5 showed genuine laps so it IS solvable.
Multi-track generalization is random-seed dependent No reliable solution yet. Trial 9 was lucky.
Mountain track model doesn't generalise to other tracks Expected — single track training generalises poorly. Next step after Exp8 succeeds.

Next Steps (Proposed, Not Yet Run)

  1. Exp 8 result: If best_model.zip drives mountain_track reliably → proceed to Step 2
  2. Combine mountain_track + generated_track using v5 reward, throttle_min=0.5, proper checkpointing
  3. Test combined model on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
  4. If yes: We have reproduced Trial 9 reliably with a better reward function

Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing COMPLETED

  • Reward: v5 (speed × CTE-quality)
  • throttle_min: 0.5
  • Method: Direct model.learn() loop, single TCP connection, NO close_and_switch
  • Steps: 90,000 total | 6,000 per segment | 15 checkpoints
  • Circle exploit fix: Short-lap terminates episode immediately
  • Peak segment: Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
  • Policy diverged: Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
  • Checkpoints saved: checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
  • Final eval results using best_model.zip (step 60k weights):
Track Ep1 Ep2 Ep3 Mean steps Result
mountain_track (training) 382 529 182 364 crashes
generated_track (zero-shot) 63 61 61 62 crashes
mini_monaco (zero-shot) 154 155 104 138 crashes at one corner
generated_road (zero-shot) 41 42 41 41 crashes
  • Throttle test: mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
  • Key findings:
    1. Circle exploit fully eliminated — no short laps observed
    2. Best model saving worked — captured step 60k peak, not step 90k drift
    3. Genuine 20-22 second laps during training from step ~18k onward
    4. Model crashes at exactly the same corner on mini_monaco every time (too fast)
    5. throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
    6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero — model CAN learn to apply high throttle when needed even with 0.2 floor

Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)

  • Change from Exp8: throttle_min: 0.5 → 0.2 (only change)
  • Reward: v5 (speed × CTE-quality) — UNCHANGED
  • Hypothesis: v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero). Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0] and can apply lower throttle on corners — potentially solving mini_monaco corner crash.
  • What we never tested: (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
  • Risk: Model may still stall on hill if gradient convergence is slow in early training. StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
  • Next test (Exp10): Add track_progress bonus to reward (v6) — one variable at a time.
  • Save dir: models/exp9-mountain-v5-throttle02/
  • Watch: tail -f /tmp/exp9.log

Exp 9 — Evaluation Results (3-set test, 1 run per track per set)

Model tested: models/exp9-mountain-v5-throttle02/best_model.zip Date: 2026-04-18 Test setup: 3 independent sets, lighting randomises each run (no fixed seed)

Track Set 1 Set 2 Set 3 Mean Pattern
mountain_track (trained) 2000 2000 2000 2000 Rock solid
generated_track (zero-shot) 79 61 82 74 Always fails — can't make first corner
generated_road (zero-shot) 651 2000 1203 1285 Highly variable — lighting dependent
mini_monaco (zero-shot) 32 60 34 42 Always fails — veers right immediately

User observations:

  • mountain_track: 80-90% of time on or near centre yellow line. Solid driving.
  • generated_road: Driving looks good when it works, but goes off course. Lighting variation causes inconsistency.
  • generated_track: Cannot make first corner at all. Model sees nothing it recognises.
  • mini_monaco: Veers right immediately at start before any visible driving. Crashes before reaching the road.

Key finding — Lighting effect confirmed: Generated_road varies 651→2000→1203 with identical model and track. ONLY lighting changes. Mountain_track is immune because it trained under many random lighting conditions. Generated_track and mini_monaco fail regardless of lighting — visual domain too different.

What this tells us about next steps: Train on mountain_track + generated_track together (v5 reward, throttle_min=0.2). Both tracks have random lighting each episode → model forced to learn lighting-invariant features. Goal: model that is reliable on both training tracks, then test generalisation to generated_road and mini_monaco.

Exp 10 — Two tracks: generated_track + mountain_track, v5 reward, throttle_min=0.2

  • Change from Exp9: Added generated_track as second training track
  • Reward: v5 (speed × CTE) — unchanged
  • throttle_min: 0.2 — unchanged from Exp9
  • Training tracks: generated_track + mountain_track (round-robin, switch every 6,000 steps)
  • Total steps: 90,000 | Steps per switch: 6,000 | ~7.5 rotations through both tracks
  • lr: 0.000725 — unchanged
  • Hypothesis: Adding generated_track visual diversity forces model to learn lighting-invariant road-following features. Mountain_track teaches hill throttle. Together should generalise better to generated_road and potentially mini_monaco.
  • Expected results: mountain_track reliable, generated_track reliable, generated_road improved, mini_monaco TBD
  • This is essentially Trial 9 repeated with: v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix

Exp 10 — Evaluation Results (3-set test, 2026-04-19)

Model tested: models/exp10-two-tracks/best_model.zip Result: TOTAL FAILURE — crashes on every track, every set.

Track Set 1 Set 2 Set 3 Mean Verdict
mountain_track (trained) 178 179 179 179 Crashes at same spot every time
generated_track (trained) 99 82 88 90 Crashes almost immediately
generated_road (zero-shot) 135 223 105 154 Crashes early
mini_monaco (zero-shot) 111 133 129 124 Crashes early

Comparison to previous best models:

  • Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. 91% regression.
  • Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124.

Analysis:

  • The round-robin track switching every 6,000 steps via multitrack_runner.train_multitrack() produced a model that learned NEITHER track. This is catastrophic interference.
  • Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in HOW the environment switching works — multitrack_runner closes and reopens envs, potentially disrupting PPO's rollout buffer and value function estimates.
  • Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has learned a fixed degenerate policy (always turn one direction) rather than responding to vision.

Key question: Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed? Possible causes: (1) env close/reopen resets PPO internal state, (2) best_model selection criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller.

Full log: agent/test-results/2026-04-19_10-15_exp10-two-tracks.log

Exp 9 vs Exp 10 — Root Cause Analysis

Aspect Exp 9 (worked ) Exp 10 (failed )
Tracks mountain_track only generated_track + mountain_track (round-robin)
Env setup VecTransposeImage(DummyVecEnv([make_env])) — created ONCE, never closed wrap_env(raw) passed to PPO, which auto-wraps; closed and reopened every 6k steps
Track switching None — single env for entire 90k steps close_and_switch() — close env, exit_scene, sleep, gym.make new track
PPO continuity Single model.learn() calls with reset_num_timesteps=False, same env model.learn() + model.set_env(new_env) after each switch
Eval between segments Direct env.reset() + predict loop on same env Same, but env may be a different track than what was just trained
Best model selection Based on eval reward on mountain_track Based on segment reward — could be from either track

Conclusion: Exp 9 kept a single persistent env connection for all 90k steps. Exp 10 closed and reopened the env every 6k steps with model.set_env(). This likely disrupts PPO's rollout buffer, value estimates, and observation normalization. Exp 9 was a completely different (simpler) script with no track switching at all.

Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work?

Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME multitrack_runner.py code — yet Trial 9 scored 1435 on mini_monaco (zero-shot) while Exp 10 crashes on every track at <180 steps.

Wave 4 Trial 9 parameters:

  • lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893
  • Trained on generated_track + mountain_track (same as Exp 10)
  • Used multitrack_runner.py via CLI subprocess (same close_and_switch logic)

Exp 10 parameters:

  • lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000
  • Nearly identical to Trial 9

But Wave 4 was mostly failures too:

Metric Value
Total Wave 4 trials 25
Scores > 500 4 / 25 (16%)
Scores > 200 5 / 25 (20%)
Median score 111.3
Mean score 343.8
Std deviation 566.2

The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when re-tested (see STATE.md).

The real conclusion: Trial 9's success was likely due to lucky random initialization of CNN weights. With 80% of trials failing under the same training methodology, the multitrack round-robin approach via close_and_switch is fundamentally unreliable. The few successes are random seed lottery winners, not evidence that the method works.

Wave 5 reproduction attempt: We tried training on generated_track only (single track, no switching, same lr=0.000725, 90k steps) to test whether the track-switching was the problem. Result stored in models/wave5-gentrack-only/. (Results were poor — could not reproduce Trial 9's quality.)

Open question: Is there a reliable way to do multi-track training, or should we focus on single-track training with domain randomization (lighting, camera angle) to achieve generalization instead?

Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED)

  • Date: 2026-04-19
  • Change from Exp10: Two sim instances (port 9091 + 9093), DummyVecEnv wraps both. PPO sees both tracks in every rollout batch. No close_and_switch.
  • Tracks: generated_track (9091) + mountain_track (9093)
  • Reward: v5 (speed × CTE) — same as Exp 9/10
  • Result: ABORTED at 66k/90k steps. Circular driving observed on generated_track. v5 reward has no efficiency term → circles at CTE≈0 earn positive reward.
  • Positive: Parallel env infrastructure works! Both sims connected, PPO trained stably with no env switching issues. Consistent improvement 14.7→67.8 combined.
  • Negative: Circular driving exploit returned because v5 dropped efficiency.

Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate)

  • Date: 2026-04-19
  • Change from Exp11: Reward v6 (speed × CTE + efficiency gate ≥ 0.15). Also stuck_steps 80→40 (faster stuck termination).
  • Tracks: generated_track (9091) + mountain_track (9093)
  • Total steps: 90,000 | lr=0.000725 | throttle_min=0.2

Training progress (eval at each 6k checkpoint):

Steps gen_track mountain Combined Note
6k 91s 130s 10.7r Early
18k 100s 100s 15.9r Improving
36k 161s 160s 26.2r
42k 160s 159s 28.9r
60k 164s 163s Plateau
78k 169s 168s 29.2r
90k 173s 172s End

Evaluation results (best_model, 3 sets per track):

Track Set 1 Set 2 Set 3 Mean Verdict
mountain_track (trained) 195 196 192 194
generated_track (trained) 192 194 192 193
generated_road (zero-shot) 192 196 194 194
mini_monaco (zero-shot) 194 192 196 194

Analysis:

  • No circular driving (efficiency gate works)
  • Remarkably consistent: all tracks ~194 steps, very low variance
  • Parallel env infrastructure is stable and reliable
  • Model plateaus at ~170-195 steps and never improves past that
  • Much worse than Exp 9 (mountain only: 2000/2000) or Wave 4 Trial 9 (2000/2000)
  • The consistency across all 4 tracks (including zero-shot) suggests the model learned a generic short-drive policy, not track-specific features
  • Possible cause: 90k steps may be insufficient for 2-env parallel training (effective steps per track = 45k each), or the efficiency gate may be suppressing early exploration

Key findings:

  1. Parallel DummyVecEnv works mechanically — this is the right infrastructure
  2. v6 reward prevents circular driving
  3. But 90k steps with 2 parallel envs may not be enough training budget
  4. Compare: Exp 9 (single track, 90k steps, v5) → 2000 steps. Exp 11b (2 tracks, 90k steps, v6) → 194 steps. The training budget per track is halved AND the reward is harder to exploit.

Next experiments to consider:

  • Increase total_timesteps to 180k-250k (restore per-track budget)
  • Try v6 reward on single track first to isolate reward vs multi-track effects
  • Try v5 reward with parallel envs but longer training (accept some circling)
  • Check if efficiency gate triggers too aggressively during normal cornering