From 86357622e3a7fb73bb106cb56ba905116e37d6d4 Mon Sep 17 00:00:00 2001 From: Paul Huliganga Date: Sun, 19 Apr 2026 10:50:11 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20session=20log=20+=20ADR-019=20=E2=80=94?= =?UTF-8?q?=20parallel=20DummyVecEnv=20for=20multi-track=20training?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- DECISIONS.md | 43 ++++++++++++ docs/SESSION_LOG_2026-04-19.md | 120 +++++++++++++++++++++++++++++++++ 2 files changed, 163 insertions(+) create mode 100644 docs/SESSION_LOG_2026-04-19.md diff --git a/DECISIONS.md b/DECISIONS.md index a5be01e..892a704 100644 --- a/DECISIONS.md +++ b/DECISIONS.md @@ -373,3 +373,46 @@ positional progress, not collision contact. This is the correct signal. **Tuning note:** stuck_steps=80 (~5 seconds at 16 steps/sec). Could be reduced to 40 (~2.5 seconds) if stuck periods are observably long. + +--- + +## ADR-019: Parallel DummyVecEnv for Multi-Track Training (Not Close-and-Switch) + +**Date:** 2026-04-19 +**Status:** Proposed (to be validated by Exp 11) + +**Context:** Multi-track training via close_and_switch() — closing the env, +reopening on a new track, calling model.set_env() — produced unreliable +results. Wave 4 had 25 trials: only 4/25 scored >500, median 111. +Exp 10 used nearly identical hyperparameters to the best Wave 4 trial +and failed completely (crashes <180 steps on all tracks). + +Root cause: PPO is an on-policy algorithm. Its rollout buffer, value +function estimates, and advantage calculations are disrupted when the +environment is swapped mid-training. The model catastrophically forgets +one track while training on another. + +**Decision:** Use SB3's DummyVecEnv with one env per track, each connected +to a separate sim instance on a different port. PPO collects experience +from ALL tracks in every rollout batch — no switching, no forgetting. + +```python +env = DummyVecEnv([ + lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})), + lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})), +]) +``` + +**Consequences:** +- Requires multiple sim instances (one per training track) +- More GPU/CPU usage — can be mitigated by running sims on separate machines +- PPO sees both tracks in every batch — no catastrophic forgetting +- No env close/reopen — stable training throughout +- This is how SB3 is designed to work with multiple environments + +**Rejected alternatives:** +- close_and_switch (current) — disrupts PPO, 80% failure rate +- Same-connection scene switching — untested, still sequential, fragile + +**Validation:** Exp 11 will test this approach. If results are consistent +across multiple runs (not lottery), this ADR is confirmed. diff --git a/docs/SESSION_LOG_2026-04-19.md b/docs/SESSION_LOG_2026-04-19.md new file mode 100644 index 0000000..f947630 --- /dev/null +++ b/docs/SESSION_LOG_2026-04-19.md @@ -0,0 +1,120 @@ +# Session Log — 2026-04-19 + +## Key Discovery: Why Multi-Track Training Fails + +### The Problem +Our multi-track training uses `close_and_switch()` which: +1. Closes the TCP connection to the sim +2. Sends `exit_scene` to go back to menu +3. Opens a NEW connection on a different track +4. Calls `model.set_env(new_env)` to swap the environment + +This disrupts PPO's training because: +- PPO's rollout buffer contains partial experience from the old track +- The value function estimates become wrong for the new track +- The advantage calculations (which drive PPO's policy updates) are corrupted +- Every switch is like ripping out a student's notebook mid-lesson + +### Evidence +- **Wave 4:** 25 trials with this methodology. Only 4/25 (16%) scored >500. + Median score 111. Trial 9 scored 1435 but was a lucky outlier. +- **Exp 10:** Same code, nearly identical hyperparameters to Trial 9. + Total failure — crashes on all tracks at <180 steps. +- **Conclusion:** Trial 9's success was random weight initialization luck, + not evidence the method works. + +### The Fix: Parallel Environments (DummyVecEnv) + +SB3's `DummyVecEnv` can wrap multiple gym environments. PPO collects +experience from ALL environments in every rollout batch. No switching, +no closing, no disruption. + +```python +env = DummyVecEnv([ + lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})), + lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})), +]) +env = VecTransposeImage(env) +model = PPO('CnnPolicy', env, ...) +model.learn(total_timesteps=90000) # both tracks in EVERY batch +``` + +This requires two sim instances on different ports (one track per sim), +but gives PPO a stable, consistent training setup — exactly how SB3 is +designed to work with multiple environments. + +### How DummyVecEnv Works (for future reference) + +PPO training loop (simplified): +``` +for each rollout batch: + for each of N steps in rollout: + for each env in DummyVecEnv: ← env[0]=generated_track, env[1]=mountain_track + action = policy(observation) + next_obs, reward, done = env.step(action) + store (obs, action, reward, done) in buffer + + compute advantages using value function + update policy using all experience from ALL envs +``` + +Key insight: the model doesn't "know" which track it's on. It just sees +images and learns a policy that works across all the images it sees. +Both tracks contribute to every policy update. This prevents catastrophic +forgetting because the model never stops seeing either track. + +With close_and_switch: model trains on track A for 6000 steps, completely +forgets track A while training on track B for 6000 steps, etc. Classic +catastrophic interference. + +With DummyVecEnv: model sees both tracks simultaneously in every batch. +Like a human alternating laps between two courses — never forgets either one. + +### Alternative: Same Env, Switch Track Scene + +Theoretically possible: keep TCP connection open, send `exit_scene` then +`load_scene(new_track)` without closing the gym env. The observation and +action spaces are identical across tracks so SB3 wouldn't notice. + +Concerns: +- gym_donkeycar's DonkeyEnv initializes scene in __init__, not designed + for mid-session scene changes +- The viewer/sim controller state machine may not handle re-loading cleanly +- Still sequential (not parallel) so still has the forgetting problem, + just without the env close/reopen disruption +- Untested — could introduce subtle bugs + +### Hardware Options +- Two sim instances on same machine (different ports: 9091, 9093) + - Risk: GPU memory pressure from two Unity instances +- Second sim on remote machine + - gym_donkeycar supports `host` parameter in conf + - Previous connection issues to remote host need debugging + +### Image Augmentation (complementary, not primary) +DonkeyCar sim has built-in augmentation options: +- Gaussian blur, image flipping, cropping +- Other donkeycar users use these for generalization +- Solves visual robustness (lighting, noise) but NOT track geometry diversity +- Best used TOGETHER with parallel multi-track training + +### Warm Start Failure Re-Analysis +Previously tried warm-starting from generated_road champion onto multi-track +training. This failed — but it used the broken close_and_switch methodology. +The warm start itself may not have been the problem. Worth retrying once +parallel envs are working. + +## Exp 10 Evaluation Results (re-run 2026-04-19) + +| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict | +|---|---|---|---|---|---| +| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot | +| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes immediately | +| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early | +| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early | + +## Next Steps +- **Exp 11:** Test parallel DummyVecEnv with two sim instances (ports 9091 + 9093) +- First: verify we can connect to both sims simultaneously +- Then: train with both tracks in parallel, same hyperparameters as Trial 9 +- Goal: consistent results (not lottery), measured over multiple runs