donkeycar-rl-autoresearch/docs/SESSION_LOG_2026-04-19.md

5.4 KiB

Session Log — 2026-04-19

Key Discovery: Why Multi-Track Training Fails

The Problem

Our multi-track training uses close_and_switch() which:

  1. Closes the TCP connection to the sim
  2. Sends exit_scene to go back to menu
  3. Opens a NEW connection on a different track
  4. Calls model.set_env(new_env) to swap the environment

This disrupts PPO's training because:

  • PPO's rollout buffer contains partial experience from the old track
  • The value function estimates become wrong for the new track
  • The advantage calculations (which drive PPO's policy updates) are corrupted
  • Every switch is like ripping out a student's notebook mid-lesson

Evidence

  • Wave 4: 25 trials with this methodology. Only 4/25 (16%) scored >500. Median score 111. Trial 9 scored 1435 but was a lucky outlier.
  • Exp 10: Same code, nearly identical hyperparameters to Trial 9. Total failure — crashes on all tracks at <180 steps.
  • Conclusion: Trial 9's success was random weight initialization luck, not evidence the method works.

The Fix: Parallel Environments (DummyVecEnv)

SB3's DummyVecEnv can wrap multiple gym environments. PPO collects experience from ALL environments in every rollout batch. No switching, no closing, no disruption.

env = DummyVecEnv([
    lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
    lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
])
env = VecTransposeImage(env)
model = PPO('CnnPolicy', env, ...)
model.learn(total_timesteps=90000)  # both tracks in EVERY batch

This requires two sim instances on different ports (one track per sim), but gives PPO a stable, consistent training setup — exactly how SB3 is designed to work with multiple environments.

How DummyVecEnv Works (for future reference)

PPO training loop (simplified):

for each rollout batch:
    for each of N steps in rollout:
        for each env in DummyVecEnv:     ← env[0]=generated_track, env[1]=mountain_track
            action = policy(observation)
            next_obs, reward, done = env.step(action)
            store (obs, action, reward, done) in buffer
    
    compute advantages using value function
    update policy using all experience from ALL envs

Key insight: the model doesn't "know" which track it's on. It just sees images and learns a policy that works across all the images it sees. Both tracks contribute to every policy update. This prevents catastrophic forgetting because the model never stops seeing either track.

With close_and_switch: model trains on track A for 6000 steps, completely forgets track A while training on track B for 6000 steps, etc. Classic catastrophic interference.

With DummyVecEnv: model sees both tracks simultaneously in every batch. Like a human alternating laps between two courses — never forgets either one.

Alternative: Same Env, Switch Track Scene

Theoretically possible: keep TCP connection open, send exit_scene then load_scene(new_track) without closing the gym env. The observation and action spaces are identical across tracks so SB3 wouldn't notice.

Concerns:

  • gym_donkeycar's DonkeyEnv initializes scene in init, not designed for mid-session scene changes
  • The viewer/sim controller state machine may not handle re-loading cleanly
  • Still sequential (not parallel) so still has the forgetting problem, just without the env close/reopen disruption
  • Untested — could introduce subtle bugs

Hardware Options

  • Two sim instances on same machine (different ports: 9091, 9093)
    • Risk: GPU memory pressure from two Unity instances
  • Second sim on remote machine
    • gym_donkeycar supports host parameter in conf
    • Previous connection issues to remote host need debugging

Image Augmentation (complementary, not primary)

DonkeyCar sim has built-in augmentation options:

  • Gaussian blur, image flipping, cropping
  • Other donkeycar users use these for generalization
  • Solves visual robustness (lighting, noise) but NOT track geometry diversity
  • Best used TOGETHER with parallel multi-track training

Warm Start Failure Re-Analysis

Previously tried warm-starting from generated_road champion onto multi-track training. This failed — but it used the broken close_and_switch methodology. The warm start itself may not have been the problem. Worth retrying once parallel envs are working.

Exp 10 Evaluation Results (re-run 2026-04-19)

Track Set 1 Set 2 Set 3 Mean Verdict
mountain_track (trained) 178 179 179 179 Crashes at same spot
generated_track (trained) 99 82 88 90 Crashes immediately
generated_road (zero-shot) 135 223 105 154 Crashes early
mini_monaco (zero-shot) 111 133 129 124 Crashes early

Next Steps

  • Exp 11: Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
    • Exp 11 (v5 reward): aborted due to circular driving on generated_track
    • Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
  • v6 reward confirmed: efficiency gate prevents circles, tests pass
  • Parallel env confirmed: mechanically sound, stable training
  • Open issue: 90k steps may be insufficient for 2-env training (45k per track)
  • Next experiment ideas:
    • Increase to 180k-250k total steps
    • Test v6 on single track to isolate reward effect
    • Check if efficiency gate fires during normal cornering (false positives)