8.3 KiB

Raw Blame History

Session Log — 2026-04-19

Key Discovery: Why Multi-Track Training Fails

The Problem

Our multi-track training uses close_and_switch() which:

Closes the TCP connection to the sim
Sends exit_scene to go back to menu
Opens a NEW connection on a different track
Calls model.set_env(new_env) to swap the environment

This disrupts PPO's training because:

PPO's rollout buffer contains partial experience from the old track
The value function estimates become wrong for the new track
The advantage calculations (which drive PPO's policy updates) are corrupted
Every switch is like ripping out a student's notebook mid-lesson

Evidence

Wave 4: 25 trials with this methodology. Only 4/25 (16%) scored >500. Median score 111. Trial 9 scored 1435 but was a lucky outlier.
Exp 10: Same code, nearly identical hyperparameters to Trial 9. Total failure — crashes on all tracks at <180 steps.
Conclusion: Trial 9's success was random weight initialization luck, not evidence the method works.

The Fix: Parallel Environments (DummyVecEnv)

SB3's DummyVecEnv can wrap multiple gym environments. PPO collects experience from ALL environments in every rollout batch. No switching, no closing, no disruption.

env = DummyVecEnv([
    lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
    lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
])
env = VecTransposeImage(env)
model = PPO('CnnPolicy', env, ...)
model.learn(total_timesteps=90000)  # both tracks in EVERY batch

This requires two sim instances on different ports (one track per sim), but gives PPO a stable, consistent training setup — exactly how SB3 is designed to work with multiple environments.

How DummyVecEnv Works (for future reference)

PPO training loop (simplified):

for each rollout batch:
    for each of N steps in rollout:
        for each env in DummyVecEnv:     ← env[0]=generated_track, env[1]=mountain_track
            action = policy(observation)
            next_obs, reward, done = env.step(action)
            store (obs, action, reward, done) in buffer
    
    compute advantages using value function
    update policy using all experience from ALL envs

Key insight: the model doesn't "know" which track it's on. It just sees images and learns a policy that works across all the images it sees. Both tracks contribute to every policy update. This prevents catastrophic forgetting because the model never stops seeing either track.

With close_and_switch: model trains on track A for 6000 steps, completely forgets track A while training on track B for 6000 steps, etc. Classic catastrophic interference.

With DummyVecEnv: model sees both tracks simultaneously in every batch. Like a human alternating laps between two courses — never forgets either one.

Alternative: Same Env, Switch Track Scene

Theoretically possible: keep TCP connection open, send exit_scene then load_scene(new_track) without closing the gym env. The observation and action spaces are identical across tracks so SB3 wouldn't notice.

Concerns:

gym_donkeycar's DonkeyEnv initializes scene in init, not designed for mid-session scene changes
The viewer/sim controller state machine may not handle re-loading cleanly
Still sequential (not parallel) so still has the forgetting problem, just without the env close/reopen disruption
Untested — could introduce subtle bugs

Hardware Options

Two sim instances on same machine (different ports: 9091, 9093)
- Risk: GPU memory pressure from two Unity instances
Second sim on remote machine
- gym_donkeycar supports host parameter in conf
- Previous connection issues to remote host need debugging

Image Augmentation (complementary, not primary)

DonkeyCar sim has built-in augmentation options:

Gaussian blur, image flipping, cropping
Other donkeycar users use these for generalization
Solves visual robustness (lighting, noise) but NOT track geometry diversity
Best used TOGETHER with parallel multi-track training

Warm Start Failure Re-Analysis

Previously tried warm-starting from generated_road champion onto multi-track training. This failed — but it used the broken close_and_switch methodology. The warm start itself may not have been the problem. Worth retrying once parallel envs are working.

Exp 10 Evaluation Results (re-run 2026-04-19)

Track	Set 1	Set 2	Set 3	Mean	Verdict
mountain_track (trained)	178	179	179	179	❌ Crashes at same spot
generated_track (trained)	99	82	88	90	❌ Crashes immediately
generated_road (zero-shot)	135	223	105	154	❌ Crashes early
mini_monaco (zero-shot)	111	133	129	124	❌ Crashes early

Next Steps

Exp 11: Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
- Exp 11 (v5 reward): aborted due to circular driving on generated_track
- Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
- Exp 11c (v6 reward, 250k): aborted — grass exploit found on generated_track
- Exp 11d: pending fixes before re-run

Critical Known Facts (DO NOT LOSE)

throttle_min history (from Exp 1-9)

throttle_min=0.2 alone: car cannot get over mountain_track hill (not enough power)
throttle_min=0.5: car gets over hill BUT throttle is baked into action space, model CANNOT output throttle < 0.5, crashes on tight corners (mini_monaco ~91 steps)
throttle_min=0.2 + v5 reward (speed×CTE): car CAN learn to self-select high throttle on hill. Proved in Exp 9 (mountain only, 90k steps) → 2000/2000 steps.
KEY INSIGHT: Exp 9 worked because 90k steps were ALL on mountain. In parallel setup (Exp 11b/11c), each track gets only ~45k effective steps AND the grass exploit contaminated training. Mountain failure in parallel runs is NOT purely a throttle issue — fix the grass exploit first, THEN see if mountain learns.

The grass exploit root cause (found 2026-04-19)

generated_track has a physical gap in the boundary mesh at the first turn
Car drives through the gap, CTE exceeds 8.0m → sim should terminate

BUT: determine_episode_over() in donkey_sim.py has this code:

if math.fabs(self.cte) > 2 * self.max_cte:  # > 16.0m
    pass   # ← INTENTIONALLY DOES NOTHING
elif math.fabs(self.cte) > self.max_cte:     # 8.0–16.0m
    self.over = True

Car quickly exceeds 16m (> 2×max_cte), hits the pass case — episode never ends
Fix: Python-side CTE patience wrapper that terminates when CTE > 4.0m for 20 steps (catches the car BEFORE it blows past 16m)

Parallel env episode asymmetry

DummyVecEnv runs both envs in every step (sequential, not truly parallel)
When mountain episode ends quickly, VecEnv auto-resets mountain and starts new episode
Meanwhile generated_track episode continues
During training (model.learn()): PPO collects experience from both and auto-resets independently — this is fine and correct
During eval: our eval loop uses done_mask, so short mountain episodes auto-reset and start new episodes that we ignore (waiting for generated_track to finish)
User observation: 'car waits at start line for generated_track episode to end' — correct

DO NOT confuse mountain rollback with stuck issue

Mountain rollback (car goes up, slows, rolls back) is a LEARNING/REWARD issue
It is NOT a stuck issue — the car is moving (rolling back = speed > 0)
StuckTerminationWrapper correctly does NOT fire (car IS moving)
Root fix: ensure training is not contaminated by other exploits, then the v5/v6 speed gradient teaches the model to apply high throttle on the hill (proved to work in Exp 9)
DO NOT add termination conditions for rollback — they interfere with valid slow hill-climbing learning

speed vs forward_vel in reward

info['speed'] comes from Unity — scalar magnitude, always ≥ 0
info['forward_vel'] computed in Python — dot(heading, velocity), negative when reversing
Our reward uses info['speed'] — car rolling backward gets positive reward
Sim's own reward correctly uses forward_vel with if forward_vel > 0.0 check
This is a known issue but NOT the primary cause of current problems (efficiency gate gives 0 reward when rolling back → net displacement ≈ 0)

8.3 KiB Raw Blame History Unescape Escape