# Session Log — 2026-04-19 ## Key Discovery: Why Multi-Track Training Fails ### The Problem Our multi-track training uses `close_and_switch()` which: 1. Closes the TCP connection to the sim 2. Sends `exit_scene` to go back to menu 3. Opens a NEW connection on a different track 4. Calls `model.set_env(new_env)` to swap the environment This disrupts PPO's training because: - PPO's rollout buffer contains partial experience from the old track - The value function estimates become wrong for the new track - The advantage calculations (which drive PPO's policy updates) are corrupted - Every switch is like ripping out a student's notebook mid-lesson ### Evidence - **Wave 4:** 25 trials with this methodology. Only 4/25 (16%) scored >500. Median score 111. Trial 9 scored 1435 but was a lucky outlier. - **Exp 10:** Same code, nearly identical hyperparameters to Trial 9. Total failure — crashes on all tracks at <180 steps. - **Conclusion:** Trial 9's success was random weight initialization luck, not evidence the method works. ### The Fix: Parallel Environments (DummyVecEnv) SB3's `DummyVecEnv` can wrap multiple gym environments. PPO collects experience from ALL environments in every rollout batch. No switching, no closing, no disruption. ```python env = DummyVecEnv([ lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})), lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})), ]) env = VecTransposeImage(env) model = PPO('CnnPolicy', env, ...) model.learn(total_timesteps=90000) # both tracks in EVERY batch ``` This requires two sim instances on different ports (one track per sim), but gives PPO a stable, consistent training setup — exactly how SB3 is designed to work with multiple environments. ### How DummyVecEnv Works (for future reference) PPO training loop (simplified): ``` for each rollout batch: for each of N steps in rollout: for each env in DummyVecEnv: ← env[0]=generated_track, env[1]=mountain_track action = policy(observation) next_obs, reward, done = env.step(action) store (obs, action, reward, done) in buffer compute advantages using value function update policy using all experience from ALL envs ``` Key insight: the model doesn't "know" which track it's on. It just sees images and learns a policy that works across all the images it sees. Both tracks contribute to every policy update. This prevents catastrophic forgetting because the model never stops seeing either track. With close_and_switch: model trains on track A for 6000 steps, completely forgets track A while training on track B for 6000 steps, etc. Classic catastrophic interference. With DummyVecEnv: model sees both tracks simultaneously in every batch. Like a human alternating laps between two courses — never forgets either one. ### Alternative: Same Env, Switch Track Scene Theoretically possible: keep TCP connection open, send `exit_scene` then `load_scene(new_track)` without closing the gym env. The observation and action spaces are identical across tracks so SB3 wouldn't notice. Concerns: - gym_donkeycar's DonkeyEnv initializes scene in __init__, not designed for mid-session scene changes - The viewer/sim controller state machine may not handle re-loading cleanly - Still sequential (not parallel) so still has the forgetting problem, just without the env close/reopen disruption - Untested — could introduce subtle bugs ### Hardware Options - Two sim instances on same machine (different ports: 9091, 9093) - Risk: GPU memory pressure from two Unity instances - Second sim on remote machine - gym_donkeycar supports `host` parameter in conf - Previous connection issues to remote host need debugging ### Image Augmentation (complementary, not primary) DonkeyCar sim has built-in augmentation options: - Gaussian blur, image flipping, cropping - Other donkeycar users use these for generalization - Solves visual robustness (lighting, noise) but NOT track geometry diversity - Best used TOGETHER with parallel multi-track training ### Warm Start Failure Re-Analysis Previously tried warm-starting from generated_road champion onto multi-track training. This failed — but it used the broken close_and_switch methodology. The warm start itself may not have been the problem. Worth retrying once parallel envs are working. ## Exp 10 Evaluation Results (re-run 2026-04-19) | Track | Set 1 | Set 2 | Set 3 | Mean | Verdict | |---|---|---|---|---|---| | mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot | | generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes immediately | | generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early | | mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early | ## Next Steps - **Exp 11:** Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093) - Exp 11 (v5 reward): aborted due to circular driving on generated_track - Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks - Exp 11c (v6 reward, 250k): aborted — grass exploit found on generated_track - Exp 11d: pending fixes before re-run ## Mountain Track Finetune + Physics Investigation (2026-04-19 late session) ### Finetune outcome summary - Created and ran `agent/experiments/exp14_finetune_v5.py` - Warm-start source: `agent/models/exp14-mountain-v5/best_model.zip` - Schedule used: - phase 1: runtime throttle floor `0.4` - phase 2: runtime throttle floor `0.2` - Training later degraded badly; later checkpoints became poor / unstable - Best usable finetune checkpoint was **not** the final model ### Robust checkpoint comparison on mountain_track We ran a deterministic mountain-only comparison over 9 episodes per candidate. Results saved to: - `agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl` - `agent/outerloop-results/mountain_candidate_eval_2026-04-19.md` Winner: - `agent/models/exp14-mountain-v5-finetune/checkpoint_0036000.zip` - promoted copy: - `agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip` Key result: - **ft_036k** achieved: - 9/9 successful episodes - 25 total laps across 9 episodes - mean lap **27.93s** - best lap **26.16s** - This beat: - original mountain champion for robustness - earlier `0.4`-floor checkpoints for robustness - later finetune checkpoints, which had degraded badly ### Mountain physics discovery in Unity sim Unity source path confirmed: - `/mnt/c/Users/Paul/Documents/projects/sdsandbox` We found a likely real root cause for hill wheelspin: - `sdsim/Assets/Scripts/WheelPhys.cs` scales wheel friction by the hit collider's physics material: - `hit.collider.material.staticFriction * originalForwardStiffness` - `mountain_track.unity` contains **4 explicit `Slippery` physics-material assignments** on the imported `long_road` FBX instance - `Slippery.staticFriction = 0.1` - `Road.staticFriction = 0.5` - `Grippy.staticFriction = 0.66` Interpretation: - mountain road traction is likely much lower than normal road tracks - this matches observed wheelspin / poor uphill progress / getting stuck on hills We created a dedicated Unity investigation branch before changing anything: - repo: `/mnt/c/Users/Paul/Documents/projects/sdsandbox` - branch: `investigate-mountain-friction` ## Critical Known Facts (DO NOT LOSE) ### throttle_min history (from Exp 1-9) - `throttle_min=0.2` alone: car cannot get over mountain_track hill (not enough power) - `throttle_min=0.5`: car gets over hill BUT throttle is baked into action space, model CANNOT output throttle < 0.5, crashes on tight corners (mini_monaco ~91 steps) - `throttle_min=0.2` + v5 reward (speed×CTE): car CAN learn to self-select high throttle on hill. Proved in Exp 9 (mountain only, 90k steps) → 2000/2000 steps. - KEY INSIGHT: Exp 9 worked because 90k steps were ALL on mountain. In parallel setup (Exp 11b/11c), each track gets only ~45k effective steps AND the grass exploit contaminated training. Mountain failure in parallel runs is NOT purely a throttle issue — fix the grass exploit first, THEN see if mountain learns. ### The grass exploit root cause (found 2026-04-19) - generated_track has a physical gap in the boundary mesh at the first turn - Car drives through the gap, CTE exceeds 8.0m → sim should terminate - BUT: `determine_episode_over()` in donkey_sim.py has this code: ```python if math.fabs(self.cte) > 2 * self.max_cte: # > 16.0m pass # ← INTENTIONALLY DOES NOTHING elif math.fabs(self.cte) > self.max_cte: # 8.0–16.0m self.over = True ``` - Car quickly exceeds 16m (> 2×max_cte), hits the `pass` case — episode never ends - Fix: Python-side CTE patience wrapper that terminates when CTE > 4.0m for 20 steps (catches the car BEFORE it blows past 16m) ### Parallel env episode asymmetry - DummyVecEnv runs both envs in every step (sequential, not truly parallel) - When mountain episode ends quickly, VecEnv auto-resets mountain and starts new episode - Meanwhile generated_track episode continues - During training (model.learn()): PPO collects experience from both and auto-resets independently — this is fine and correct - During eval: our eval loop uses done_mask, so short mountain episodes auto-reset and start new episodes that we ignore (waiting for generated_track to finish) - User observation: 'car waits at start line for generated_track episode to end' — correct ### DO NOT confuse mountain rollback with stuck issue - Mountain rollback (car goes up, slows, rolls back) is a LEARNING/REWARD issue - It is NOT a stuck issue — the car is moving (rolling back = speed > 0) - StuckTerminationWrapper correctly does NOT fire (car IS moving) - Root fix: ensure training is not contaminated by other exploits, then the v5/v6 speed gradient teaches the model to apply high throttle on the hill (proved to work in Exp 9) - DO NOT add termination conditions for rollback — they interfere with valid slow hill-climbing learning ### speed vs forward_vel in reward - info['speed'] comes from Unity — scalar magnitude, always ≥ 0 - info['forward_vel'] computed in Python — dot(heading, velocity), negative when reversing - Our reward uses info['speed'] — car rolling backward gets positive reward - Sim's own reward correctly uses forward_vel with `if forward_vel > 0.0` check - This is a known issue but NOT the primary cause of current problems (efficiency gate gives 0 reward when rolling back → net displacement ≈ 0)