donkeycar-rl-autoresearch/docs/SESSION_LOG_2026-04-19.md

10 KiB
Raw Blame History

Session Log — 2026-04-19

Key Discovery: Why Multi-Track Training Fails

The Problem

Our multi-track training uses close_and_switch() which:

  1. Closes the TCP connection to the sim
  2. Sends exit_scene to go back to menu
  3. Opens a NEW connection on a different track
  4. Calls model.set_env(new_env) to swap the environment

This disrupts PPO's training because:

  • PPO's rollout buffer contains partial experience from the old track
  • The value function estimates become wrong for the new track
  • The advantage calculations (which drive PPO's policy updates) are corrupted
  • Every switch is like ripping out a student's notebook mid-lesson

Evidence

  • Wave 4: 25 trials with this methodology. Only 4/25 (16%) scored >500. Median score 111. Trial 9 scored 1435 but was a lucky outlier.
  • Exp 10: Same code, nearly identical hyperparameters to Trial 9. Total failure — crashes on all tracks at <180 steps.
  • Conclusion: Trial 9's success was random weight initialization luck, not evidence the method works.

The Fix: Parallel Environments (DummyVecEnv)

SB3's DummyVecEnv can wrap multiple gym environments. PPO collects experience from ALL environments in every rollout batch. No switching, no closing, no disruption.

env = DummyVecEnv([
    lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
    lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
])
env = VecTransposeImage(env)
model = PPO('CnnPolicy', env, ...)
model.learn(total_timesteps=90000)  # both tracks in EVERY batch

This requires two sim instances on different ports (one track per sim), but gives PPO a stable, consistent training setup — exactly how SB3 is designed to work with multiple environments.

How DummyVecEnv Works (for future reference)

PPO training loop (simplified):

for each rollout batch:
    for each of N steps in rollout:
        for each env in DummyVecEnv:     ← env[0]=generated_track, env[1]=mountain_track
            action = policy(observation)
            next_obs, reward, done = env.step(action)
            store (obs, action, reward, done) in buffer
    
    compute advantages using value function
    update policy using all experience from ALL envs

Key insight: the model doesn't "know" which track it's on. It just sees images and learns a policy that works across all the images it sees. Both tracks contribute to every policy update. This prevents catastrophic forgetting because the model never stops seeing either track.

With close_and_switch: model trains on track A for 6000 steps, completely forgets track A while training on track B for 6000 steps, etc. Classic catastrophic interference.

With DummyVecEnv: model sees both tracks simultaneously in every batch. Like a human alternating laps between two courses — never forgets either one.

Alternative: Same Env, Switch Track Scene

Theoretically possible: keep TCP connection open, send exit_scene then load_scene(new_track) without closing the gym env. The observation and action spaces are identical across tracks so SB3 wouldn't notice.

Concerns:

  • gym_donkeycar's DonkeyEnv initializes scene in init, not designed for mid-session scene changes
  • The viewer/sim controller state machine may not handle re-loading cleanly
  • Still sequential (not parallel) so still has the forgetting problem, just without the env close/reopen disruption
  • Untested — could introduce subtle bugs

Hardware Options

  • Two sim instances on same machine (different ports: 9091, 9093)
    • Risk: GPU memory pressure from two Unity instances
  • Second sim on remote machine
    • gym_donkeycar supports host parameter in conf
    • Previous connection issues to remote host need debugging

Image Augmentation (complementary, not primary)

DonkeyCar sim has built-in augmentation options:

  • Gaussian blur, image flipping, cropping
  • Other donkeycar users use these for generalization
  • Solves visual robustness (lighting, noise) but NOT track geometry diversity
  • Best used TOGETHER with parallel multi-track training

Warm Start Failure Re-Analysis

Previously tried warm-starting from generated_road champion onto multi-track training. This failed — but it used the broken close_and_switch methodology. The warm start itself may not have been the problem. Worth retrying once parallel envs are working.

Exp 10 Evaluation Results (re-run 2026-04-19)

Track Set 1 Set 2 Set 3 Mean Verdict
mountain_track (trained) 178 179 179 179 Crashes at same spot
generated_track (trained) 99 82 88 90 Crashes immediately
generated_road (zero-shot) 135 223 105 154 Crashes early
mini_monaco (zero-shot) 111 133 129 124 Crashes early

Next Steps

  • Exp 11: Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
    • Exp 11 (v5 reward): aborted due to circular driving on generated_track
    • Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
    • Exp 11c (v6 reward, 250k): aborted — grass exploit found on generated_track
    • Exp 11d: pending fixes before re-run

Mountain Track Finetune + Physics Investigation (2026-04-19 late session)

Finetune outcome summary

  • Created and ran agent/experiments/exp14_finetune_v5.py
  • Warm-start source: agent/models/exp14-mountain-v5/best_model.zip
  • Schedule used:
    • phase 1: runtime throttle floor 0.4
    • phase 2: runtime throttle floor 0.2
  • Training later degraded badly; later checkpoints became poor / unstable
  • Best usable finetune checkpoint was not the final model

Robust checkpoint comparison on mountain_track

We ran a deterministic mountain-only comparison over 9 episodes per candidate. Results saved to:

  • agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl
  • agent/outerloop-results/mountain_candidate_eval_2026-04-19.md

Winner:

  • agent/models/exp14-mountain-v5-finetune/checkpoint_0036000.zip
  • promoted copy:
    • agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip

Key result:

  • ft_036k achieved:
    • 9/9 successful episodes
    • 25 total laps across 9 episodes
    • mean lap 27.93s
    • best lap 26.16s
  • This beat:
    • original mountain champion for robustness
    • earlier 0.4-floor checkpoints for robustness
    • later finetune checkpoints, which had degraded badly

Mountain physics discovery in Unity sim

Unity source path confirmed:

  • /mnt/c/Users/Paul/Documents/projects/sdsandbox

We found a likely real root cause for hill wheelspin:

  • sdsim/Assets/Scripts/WheelPhys.cs scales wheel friction by the hit collider's physics material:
    • hit.collider.material.staticFriction * originalForwardStiffness
  • mountain_track.unity contains 4 explicit Slippery physics-material assignments on the imported long_road FBX instance
  • Slippery.staticFriction = 0.1
  • Road.staticFriction = 0.5
  • Grippy.staticFriction = 0.66

Interpretation:

  • mountain road traction is likely much lower than normal road tracks
  • this matches observed wheelspin / poor uphill progress / getting stuck on hills

We created a dedicated Unity investigation branch before changing anything:

  • repo: /mnt/c/Users/Paul/Documents/projects/sdsandbox
  • branch: investigate-mountain-friction

Critical Known Facts (DO NOT LOSE)

throttle_min history (from Exp 1-9)

  • throttle_min=0.2 alone: car cannot get over mountain_track hill (not enough power)
  • throttle_min=0.5: car gets over hill BUT throttle is baked into action space, model CANNOT output throttle < 0.5, crashes on tight corners (mini_monaco ~91 steps)
  • throttle_min=0.2 + v5 reward (speed×CTE): car CAN learn to self-select high throttle on hill. Proved in Exp 9 (mountain only, 90k steps) → 2000/2000 steps.
  • KEY INSIGHT: Exp 9 worked because 90k steps were ALL on mountain. In parallel setup (Exp 11b/11c), each track gets only ~45k effective steps AND the grass exploit contaminated training. Mountain failure in parallel runs is NOT purely a throttle issue — fix the grass exploit first, THEN see if mountain learns.

The grass exploit root cause (found 2026-04-19)

  • generated_track has a physical gap in the boundary mesh at the first turn
  • Car drives through the gap, CTE exceeds 8.0m → sim should terminate
  • BUT: determine_episode_over() in donkey_sim.py has this code:
    if math.fabs(self.cte) > 2 * self.max_cte:  # > 16.0m
        pass   # ← INTENTIONALLY DOES NOTHING
    elif math.fabs(self.cte) > self.max_cte:     # 8.016.0m
        self.over = True
    
  • Car quickly exceeds 16m (> 2×max_cte), hits the pass case — episode never ends
  • Fix: Python-side CTE patience wrapper that terminates when CTE > 4.0m for 20 steps (catches the car BEFORE it blows past 16m)

Parallel env episode asymmetry

  • DummyVecEnv runs both envs in every step (sequential, not truly parallel)
  • When mountain episode ends quickly, VecEnv auto-resets mountain and starts new episode
  • Meanwhile generated_track episode continues
  • During training (model.learn()): PPO collects experience from both and auto-resets independently — this is fine and correct
  • During eval: our eval loop uses done_mask, so short mountain episodes auto-reset and start new episodes that we ignore (waiting for generated_track to finish)
  • User observation: 'car waits at start line for generated_track episode to end' — correct

DO NOT confuse mountain rollback with stuck issue

  • Mountain rollback (car goes up, slows, rolls back) is a LEARNING/REWARD issue
  • It is NOT a stuck issue — the car is moving (rolling back = speed > 0)
  • StuckTerminationWrapper correctly does NOT fire (car IS moving)
  • Root fix: ensure training is not contaminated by other exploits, then the v5/v6 speed gradient teaches the model to apply high throttle on the hill (proved to work in Exp 9)
  • DO NOT add termination conditions for rollback — they interfere with valid slow hill-climbing learning

speed vs forward_vel in reward

  • info['speed'] comes from Unity — scalar magnitude, always ≥ 0
  • info['forward_vel'] computed in Python — dot(heading, velocity), negative when reversing
  • Our reward uses info['speed'] — car rolling backward gets positive reward
  • Sim's own reward correctly uses forward_vel with if forward_vel > 0.0 check
  • This is a known issue but NOT the primary cause of current problems (efficiency gate gives 0 reward when rolling back → net displacement ≈ 0)