donkeycar-rl-autoresearch/docs/TEST_HISTORY.md

8.6 KiB
Raw Blame History

Test History — DonkeyCar RL Autoresearch

Last updated: 2026-04-18

This document records every significant training experiment, what was changed, what was observed, and what was learned. Use this to make methodical decisions rather than random changes.


Baseline Models (Phase 1 & 2)

Phase 2 Champion

  • Model: models/champion/model.zip
  • Track trained on: generated_road only
  • Steps: 13,328
  • Hyperparams: lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward
  • Result: Drives generated_road perfectly, stays in right lane
  • Zero-shot: Fails on generated_track (confirmed), Fails on mini_monaco
  • Notes: Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps)

Mountain Track Experiments

All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps

Exp 1 — Mountain track, old v4 reward, throttle_min=0.2

  • Reward: v4 (CTE × efficiency × speed)
  • throttle_min: 0.2
  • Key observation: Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power
  • Root cause: v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle")
  • Learned: v4 reward is broken for inclined terrain

Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k

  • Reward: v4
  • throttle_min: 0.2
  • Key observation: Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn
  • Result: Killed early — no improvement
  • Learned: More steps alone cannot fix a broken reward signal

Exp 3 — Mountain track, old v4 reward, throttle_min=0.5

  • Reward: v4
  • throttle_min: 0.5 (increased to overcome hill)
  • Key observation: Circle exploit dominated entire run — 0.5-1.75 second laps throughout
  • Lap times logged: All short (exploit)
  • Result: Model useless (reward=4.99 after 90k steps)
  • Learned: Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high

Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5

  • Reward: v4
  • throttle_min: 0.5
  • Key observation: Killed early — same 2 behaviors (left into barrier, straight into barrier)
  • Result: Killed
  • Learned: Continuing bad training does not help

Exp 5 — Mountain track, v5 reward, throttle_min=0.5 KEY EXPERIMENT

  • Reward: v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills
  • throttle_min: 0.5
  • Method: Direct model.learn() — NO train_multitrack(), ONE connection throughout
  • Key observation: Genuine 20-22 second laps appearing from step ~30,000 onward
  • Lap times: 19-22 seconds (genuine), consistently for 60k steps
  • Result: Final model poor — best model was at step ~30k but we only saved final (step 90k) model
  • Root cause of failure: No best-model saving. Policy peaked at 30k, diverged by 90k
  • Learned:
    1. v5 reward WORKS for mountain track
    2. throttle_min=0.5 WORKS for hill
    3. Direct model.learn() (no track switching) avoids phantom car issues
    4. MUST save best model during training, not just final

Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment)

  • Reward: v5
  • throttle_min: 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments)
  • Method: train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint)
  • Key observation: Circle exploit dominated — only 0.5-1.75 second laps throughout
  • Result: Only 1 checkpoint saved (at step 90k). Best reward=4.99
  • Root cause: Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation)
  • Learned: steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing

Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed

  • Reward: v5 + short-lap now TERMINATES episode (not just penalty)
  • throttle_min: 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default)
  • Method: train_multitrack() with steps_per_switch=6000 (15 segments)
  • Key observation: Car in LEFT lane, sitting doing nothing. Not normal spawn position.
  • Hypothesis: Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car.
  • Result: Killed — phantom car issue
  • Learned:
    1. close_and_switch() between segments creates phantom car risk for single-track training
    2. throttle_min MUST be passed consistently — module default is 0.2, not 0.5
    3. For single-track training: do NOT use close_and_switch() at all

Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW)

  • Reward: v5 + short-lap terminates episode
  • throttle_min: 0.5 throughout (no close_and_switch = no module default override)
  • Method: Direct model.learn() in loop — ONE connection throughout entire run
  • Checkpoints: 15 numbered saves (every 6,000 steps) + best_model.zip
  • PID: 2941877, log: /tmp/exp8.log
  • Status: Running since 11:17, ~1h45m total
  • Watch: tail -f /tmp/exp8.log
  • Success criteria: Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval

Wave 4 Multi-Track Experiments (generated_track + mountain_track)

Trial 9 BEST OVERALL MODEL

  • Model: models/wave4-trial-0009/model.zip
  • Tracks: generated_track + mountain_track (round-robin, switch every 6,851 steps)
  • Steps: 89,893 total (~45k per track)
  • Hyperparams: lr=0.000725, switch=6,851
  • Reward: v4 (old — before exploit patches)
  • Result:
    • Drives generated_track (3/3 episodes, 13-16 second genuine laps)
    • Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training)
    • Crashes on mountain_track (~200 steps — hill + corner)
    • Crashes on generated_road (~46 steps — turns right immediately)
  • Notes: Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model.

Wave 4 Other Trials (1-25 except Trial 9)

  • Result: All crashed on mini_monaco within 20-265 steps
  • Median mini_monaco score: ~112 (crashes at ~130 steps)
  • Trials 14, 25: Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight)
  • Learned: Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably.

Key Decisions Made (What We Keep)

Decision Reason
v5 reward: speed × CTE-quality Directly incentivises throttle on hills. v4 gave zero gradient on inclines.
throttle_min=0.5 for mountain_track Overcomes hill. Car can now reach first corner.
Short-lap penalty + EPISODE TERMINATION Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable.
Numbered checkpoints every segment Never lose a good mid-training model again (ADR-017)
best_model.zip updated on new best segment score Final model ≠ best model. Peak can be at 30k even if final is at 90k.
Single TCP connection for single-track training Avoids phantom car problem from close_and_switch()
lr=0.000725 From Trial 9 (best model). Consistent with good results.

Key Problems Still Open

Problem Status
Mountain track circle exploit Partially fixed — episode termination added. Exp8 will show if it holds.
Mountain track — car can't navigate first corner reliably Still being investigated. Exp5 showed genuine laps so it IS solvable.
Multi-track generalization is random-seed dependent No reliable solution yet. Trial 9 was lucky.
Mountain track model doesn't generalise to other tracks Expected — single track training generalises poorly. Next step after Exp8 succeeds.

Next Steps (Proposed, Not Yet Run)

  1. Exp 8 result: If best_model.zip drives mountain_track reliably → proceed to Step 2
  2. Combine mountain_track + generated_track using v5 reward, throttle_min=0.5, proper checkpointing
  3. Test combined model on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
  4. If yes: We have reproduced Trial 9 reliably with a better reward function