5.4 KiB

Raw Blame History

Session Log — 2026-04-19

Key Discovery: Why Multi-Track Training Fails

The Problem

Our multi-track training uses close_and_switch() which:

Closes the TCP connection to the sim
Sends exit_scene to go back to menu
Opens a NEW connection on a different track
Calls model.set_env(new_env) to swap the environment

This disrupts PPO's training because:

PPO's rollout buffer contains partial experience from the old track
The value function estimates become wrong for the new track
The advantage calculations (which drive PPO's policy updates) are corrupted
Every switch is like ripping out a student's notebook mid-lesson

Evidence

Wave 4: 25 trials with this methodology. Only 4/25 (16%) scored >500. Median score 111. Trial 9 scored 1435 but was a lucky outlier.
Exp 10: Same code, nearly identical hyperparameters to Trial 9. Total failure — crashes on all tracks at <180 steps.
Conclusion: Trial 9's success was random weight initialization luck, not evidence the method works.

The Fix: Parallel Environments (DummyVecEnv)

SB3's DummyVecEnv can wrap multiple gym environments. PPO collects experience from ALL environments in every rollout batch. No switching, no closing, no disruption.

env = DummyVecEnv([
    lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
    lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
])
env = VecTransposeImage(env)
model = PPO('CnnPolicy', env, ...)
model.learn(total_timesteps=90000)  # both tracks in EVERY batch

This requires two sim instances on different ports (one track per sim), but gives PPO a stable, consistent training setup — exactly how SB3 is designed to work with multiple environments.

How DummyVecEnv Works (for future reference)

PPO training loop (simplified):

for each rollout batch:
    for each of N steps in rollout:
        for each env in DummyVecEnv:     ← env[0]=generated_track, env[1]=mountain_track
            action = policy(observation)
            next_obs, reward, done = env.step(action)
            store (obs, action, reward, done) in buffer
    
    compute advantages using value function
    update policy using all experience from ALL envs

Key insight: the model doesn't "know" which track it's on. It just sees images and learns a policy that works across all the images it sees. Both tracks contribute to every policy update. This prevents catastrophic forgetting because the model never stops seeing either track.

With close_and_switch: model trains on track A for 6000 steps, completely forgets track A while training on track B for 6000 steps, etc. Classic catastrophic interference.

With DummyVecEnv: model sees both tracks simultaneously in every batch. Like a human alternating laps between two courses — never forgets either one.

Alternative: Same Env, Switch Track Scene

Theoretically possible: keep TCP connection open, send exit_scene then load_scene(new_track) without closing the gym env. The observation and action spaces are identical across tracks so SB3 wouldn't notice.

Concerns:

gym_donkeycar's DonkeyEnv initializes scene in init, not designed for mid-session scene changes
The viewer/sim controller state machine may not handle re-loading cleanly
Still sequential (not parallel) so still has the forgetting problem, just without the env close/reopen disruption
Untested — could introduce subtle bugs

Hardware Options

Two sim instances on same machine (different ports: 9091, 9093)
- Risk: GPU memory pressure from two Unity instances
Second sim on remote machine
- gym_donkeycar supports host parameter in conf
- Previous connection issues to remote host need debugging

Image Augmentation (complementary, not primary)

DonkeyCar sim has built-in augmentation options:

Gaussian blur, image flipping, cropping
Other donkeycar users use these for generalization
Solves visual robustness (lighting, noise) but NOT track geometry diversity
Best used TOGETHER with parallel multi-track training

Warm Start Failure Re-Analysis

Previously tried warm-starting from generated_road champion onto multi-track training. This failed — but it used the broken close_and_switch methodology. The warm start itself may not have been the problem. Worth retrying once parallel envs are working.

Exp 10 Evaluation Results (re-run 2026-04-19)

Track	Set 1	Set 2	Set 3	Mean	Verdict
mountain_track (trained)	178	179	179	179	❌ Crashes at same spot
generated_track (trained)	99	82	88	90	❌ Crashes immediately
generated_road (zero-shot)	135	223	105	154	❌ Crashes early
mini_monaco (zero-shot)	111	133	129	124	❌ Crashes early

Next Steps

Exp 11: Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
- Exp 11 (v5 reward): aborted due to circular driving on generated_track
- Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
v6 reward confirmed: efficiency gate prevents circles, tests pass
Parallel env confirmed: mechanically sound, stable training
Open issue: 90k steps may be insufficient for 2-env training (45k per track)
Next experiment ideas:
- Increase to 180k-250k total steps
- Test v6 on single track to isolate reward effect
- Check if efficiency gate fires during normal cornering (false positives)

5.4 KiB Raw Blame History