8.3 KiB
Session Log — 2026-04-19
Key Discovery: Why Multi-Track Training Fails
The Problem
Our multi-track training uses close_and_switch() which:
- Closes the TCP connection to the sim
- Sends
exit_sceneto go back to menu - Opens a NEW connection on a different track
- Calls
model.set_env(new_env)to swap the environment
This disrupts PPO's training because:
- PPO's rollout buffer contains partial experience from the old track
- The value function estimates become wrong for the new track
- The advantage calculations (which drive PPO's policy updates) are corrupted
- Every switch is like ripping out a student's notebook mid-lesson
Evidence
- Wave 4: 25 trials with this methodology. Only 4/25 (16%) scored >500. Median score 111. Trial 9 scored 1435 but was a lucky outlier.
- Exp 10: Same code, nearly identical hyperparameters to Trial 9. Total failure — crashes on all tracks at <180 steps.
- Conclusion: Trial 9's success was random weight initialization luck, not evidence the method works.
The Fix: Parallel Environments (DummyVecEnv)
SB3's DummyVecEnv can wrap multiple gym environments. PPO collects
experience from ALL environments in every rollout batch. No switching,
no closing, no disruption.
env = DummyVecEnv([
lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
])
env = VecTransposeImage(env)
model = PPO('CnnPolicy', env, ...)
model.learn(total_timesteps=90000) # both tracks in EVERY batch
This requires two sim instances on different ports (one track per sim), but gives PPO a stable, consistent training setup — exactly how SB3 is designed to work with multiple environments.
How DummyVecEnv Works (for future reference)
PPO training loop (simplified):
for each rollout batch:
for each of N steps in rollout:
for each env in DummyVecEnv: ← env[0]=generated_track, env[1]=mountain_track
action = policy(observation)
next_obs, reward, done = env.step(action)
store (obs, action, reward, done) in buffer
compute advantages using value function
update policy using all experience from ALL envs
Key insight: the model doesn't "know" which track it's on. It just sees images and learns a policy that works across all the images it sees. Both tracks contribute to every policy update. This prevents catastrophic forgetting because the model never stops seeing either track.
With close_and_switch: model trains on track A for 6000 steps, completely forgets track A while training on track B for 6000 steps, etc. Classic catastrophic interference.
With DummyVecEnv: model sees both tracks simultaneously in every batch. Like a human alternating laps between two courses — never forgets either one.
Alternative: Same Env, Switch Track Scene
Theoretically possible: keep TCP connection open, send exit_scene then
load_scene(new_track) without closing the gym env. The observation and
action spaces are identical across tracks so SB3 wouldn't notice.
Concerns:
- gym_donkeycar's DonkeyEnv initializes scene in init, not designed for mid-session scene changes
- The viewer/sim controller state machine may not handle re-loading cleanly
- Still sequential (not parallel) so still has the forgetting problem, just without the env close/reopen disruption
- Untested — could introduce subtle bugs
Hardware Options
- Two sim instances on same machine (different ports: 9091, 9093)
- Risk: GPU memory pressure from two Unity instances
- Second sim on remote machine
- gym_donkeycar supports
hostparameter in conf - Previous connection issues to remote host need debugging
- gym_donkeycar supports
Image Augmentation (complementary, not primary)
DonkeyCar sim has built-in augmentation options:
- Gaussian blur, image flipping, cropping
- Other donkeycar users use these for generalization
- Solves visual robustness (lighting, noise) but NOT track geometry diversity
- Best used TOGETHER with parallel multi-track training
Warm Start Failure Re-Analysis
Previously tried warm-starting from generated_road champion onto multi-track training. This failed — but it used the broken close_and_switch methodology. The warm start itself may not have been the problem. Worth retrying once parallel envs are working.
Exp 10 Evaluation Results (re-run 2026-04-19)
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|---|---|---|---|---|---|
| mountain_track (trained) | 178 | 179 | 179 | 179 | ❌ Crashes at same spot |
| generated_track (trained) | 99 | 82 | 88 | 90 | ❌ Crashes immediately |
| generated_road (zero-shot) | 135 | 223 | 105 | 154 | ❌ Crashes early |
| mini_monaco (zero-shot) | 111 | 133 | 129 | 124 | ❌ Crashes early |
Next Steps
- Exp 11: Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
- Exp 11 (v5 reward): aborted due to circular driving on generated_track
- Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
- Exp 11c (v6 reward, 250k): aborted — grass exploit found on generated_track
- Exp 11d: pending fixes before re-run
Critical Known Facts (DO NOT LOSE)
throttle_min history (from Exp 1-9)
throttle_min=0.2alone: car cannot get over mountain_track hill (not enough power)throttle_min=0.5: car gets over hill BUT throttle is baked into action space, model CANNOT output throttle < 0.5, crashes on tight corners (mini_monaco ~91 steps)throttle_min=0.2+ v5 reward (speed×CTE): car CAN learn to self-select high throttle on hill. Proved in Exp 9 (mountain only, 90k steps) → 2000/2000 steps.- KEY INSIGHT: Exp 9 worked because 90k steps were ALL on mountain. In parallel setup (Exp 11b/11c), each track gets only ~45k effective steps AND the grass exploit contaminated training. Mountain failure in parallel runs is NOT purely a throttle issue — fix the grass exploit first, THEN see if mountain learns.
The grass exploit root cause (found 2026-04-19)
- generated_track has a physical gap in the boundary mesh at the first turn
- Car drives through the gap, CTE exceeds 8.0m → sim should terminate
- BUT:
determine_episode_over()in donkey_sim.py has this code:if math.fabs(self.cte) > 2 * self.max_cte: # > 16.0m pass # ← INTENTIONALLY DOES NOTHING elif math.fabs(self.cte) > self.max_cte: # 8.0–16.0m self.over = True - Car quickly exceeds 16m (> 2×max_cte), hits the
passcase — episode never ends - Fix: Python-side CTE patience wrapper that terminates when CTE > 4.0m for 20 steps (catches the car BEFORE it blows past 16m)
Parallel env episode asymmetry
- DummyVecEnv runs both envs in every step (sequential, not truly parallel)
- When mountain episode ends quickly, VecEnv auto-resets mountain and starts new episode
- Meanwhile generated_track episode continues
- During training (model.learn()): PPO collects experience from both and auto-resets independently — this is fine and correct
- During eval: our eval loop uses done_mask, so short mountain episodes auto-reset and start new episodes that we ignore (waiting for generated_track to finish)
- User observation: 'car waits at start line for generated_track episode to end' — correct
DO NOT confuse mountain rollback with stuck issue
- Mountain rollback (car goes up, slows, rolls back) is a LEARNING/REWARD issue
- It is NOT a stuck issue — the car is moving (rolling back = speed > 0)
- StuckTerminationWrapper correctly does NOT fire (car IS moving)
- Root fix: ensure training is not contaminated by other exploits, then the v5/v6 speed gradient teaches the model to apply high throttle on the hill (proved to work in Exp 9)
- DO NOT add termination conditions for rollback — they interfere with valid slow hill-climbing learning
speed vs forward_vel in reward
- info['speed'] comes from Unity — scalar magnitude, always ≥ 0
- info['forward_vel'] computed in Python — dot(heading, velocity), negative when reversing
- Our reward uses info['speed'] — car rolling backward gets positive reward
- Sim's own reward correctly uses forward_vel with
if forward_vel > 0.0check - This is a known issue but NOT the primary cause of current problems (efficiency gate gives 0 reward when rolling back → net displacement ≈ 0)