From 86357622e3a7fb73bb106cb56ba905116e37d6d4 Mon Sep 17 00:00:00 2001
From: Paul Huliganga <paje0101@gmail.com>
Date: Sun, 19 Apr 2026 10:50:11 -0400
Subject: [PATCH] =?UTF-8?q?docs:=20session=20log=20+=20ADR-019=20=E2=80=94?=
 =?UTF-8?q?=20parallel=20DummyVecEnv=20for=20multi-track=20training?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 DECISIONS.md                   |  43 ++++++++++++
 docs/SESSION_LOG_2026-04-19.md | 120 +++++++++++++++++++++++++++++++++
 2 files changed, 163 insertions(+)
 create mode 100644 docs/SESSION_LOG_2026-04-19.md

diff --git a/DECISIONS.md b/DECISIONS.md
index a5be01e..892a704 100644
--- a/DECISIONS.md
+++ b/DECISIONS.md
@@ -373,3 +373,46 @@ positional progress, not collision contact. This is the correct signal.
 
 **Tuning note:** stuck_steps=80 (~5 seconds at 16 steps/sec). Could be
 reduced to 40 (~2.5 seconds) if stuck periods are observably long.
+
+---
+
+## ADR-019: Parallel DummyVecEnv for Multi-Track Training (Not Close-and-Switch)
+
+**Date:** 2026-04-19
+**Status:** Proposed (to be validated by Exp 11)
+
+**Context:** Multi-track training via close_and_switch() — closing the env,
+reopening on a new track, calling model.set_env() — produced unreliable
+results. Wave 4 had 25 trials: only 4/25 scored >500, median 111.
+Exp 10 used nearly identical hyperparameters to the best Wave 4 trial
+and failed completely (crashes <180 steps on all tracks).
+
+Root cause: PPO is an on-policy algorithm. Its rollout buffer, value
+function estimates, and advantage calculations are disrupted when the
+environment is swapped mid-training. The model catastrophically forgets
+one track while training on another.
+
+**Decision:** Use SB3's DummyVecEnv with one env per track, each connected
+to a separate sim instance on a different port. PPO collects experience
+from ALL tracks in every rollout batch — no switching, no forgetting.
+
+```python
+env = DummyVecEnv([
+    lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
+    lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
+])
+```
+
+**Consequences:**
+- Requires multiple sim instances (one per training track)
+- More GPU/CPU usage — can be mitigated by running sims on separate machines
+- PPO sees both tracks in every batch — no catastrophic forgetting
+- No env close/reopen — stable training throughout
+- This is how SB3 is designed to work with multiple environments
+
+**Rejected alternatives:**
+- close_and_switch (current) — disrupts PPO, 80% failure rate
+- Same-connection scene switching — untested, still sequential, fragile
+
+**Validation:** Exp 11 will test this approach. If results are consistent
+across multiple runs (not lottery), this ADR is confirmed.
diff --git a/docs/SESSION_LOG_2026-04-19.md b/docs/SESSION_LOG_2026-04-19.md
new file mode 100644
index 0000000..f947630
--- /dev/null
+++ b/docs/SESSION_LOG_2026-04-19.md
@@ -0,0 +1,120 @@
+# Session Log — 2026-04-19
+
+## Key Discovery: Why Multi-Track Training Fails
+
+### The Problem
+Our multi-track training uses `close_and_switch()` which:
+1. Closes the TCP connection to the sim
+2. Sends `exit_scene` to go back to menu
+3. Opens a NEW connection on a different track
+4. Calls `model.set_env(new_env)` to swap the environment
+
+This disrupts PPO's training because:
+- PPO's rollout buffer contains partial experience from the old track
+- The value function estimates become wrong for the new track
+- The advantage calculations (which drive PPO's policy updates) are corrupted
+- Every switch is like ripping out a student's notebook mid-lesson
+
+### Evidence
+- **Wave 4:** 25 trials with this methodology. Only 4/25 (16%) scored >500.
+  Median score 111. Trial 9 scored 1435 but was a lucky outlier.
+- **Exp 10:** Same code, nearly identical hyperparameters to Trial 9.
+  Total failure — crashes on all tracks at <180 steps.
+- **Conclusion:** Trial 9's success was random weight initialization luck,
+  not evidence the method works.
+
+### The Fix: Parallel Environments (DummyVecEnv)
+
+SB3's `DummyVecEnv` can wrap multiple gym environments. PPO collects
+experience from ALL environments in every rollout batch. No switching,
+no closing, no disruption.
+
+```python
+env = DummyVecEnv([
+    lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
+    lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
+])
+env = VecTransposeImage(env)
+model = PPO('CnnPolicy', env, ...)
+model.learn(total_timesteps=90000)  # both tracks in EVERY batch
+```
+
+This requires two sim instances on different ports (one track per sim),
+but gives PPO a stable, consistent training setup — exactly how SB3 is
+designed to work with multiple environments.
+
+### How DummyVecEnv Works (for future reference)
+
+PPO training loop (simplified):
+```
+for each rollout batch:
+    for each of N steps in rollout:
+        for each env in DummyVecEnv:     ← env[0]=generated_track, env[1]=mountain_track
+            action = policy(observation)
+            next_obs, reward, done = env.step(action)
+            store (obs, action, reward, done) in buffer
+    
+    compute advantages using value function
+    update policy using all experience from ALL envs
+```
+
+Key insight: the model doesn't "know" which track it's on. It just sees
+images and learns a policy that works across all the images it sees.
+Both tracks contribute to every policy update. This prevents catastrophic
+forgetting because the model never stops seeing either track.
+
+With close_and_switch: model trains on track A for 6000 steps, completely
+forgets track A while training on track B for 6000 steps, etc. Classic
+catastrophic interference.
+
+With DummyVecEnv: model sees both tracks simultaneously in every batch.
+Like a human alternating laps between two courses — never forgets either one.
+
+### Alternative: Same Env, Switch Track Scene
+
+Theoretically possible: keep TCP connection open, send `exit_scene` then
+`load_scene(new_track)` without closing the gym env. The observation and
+action spaces are identical across tracks so SB3 wouldn't notice.
+
+Concerns:
+- gym_donkeycar's DonkeyEnv initializes scene in __init__, not designed
+  for mid-session scene changes
+- The viewer/sim controller state machine may not handle re-loading cleanly
+- Still sequential (not parallel) so still has the forgetting problem,
+  just without the env close/reopen disruption
+- Untested — could introduce subtle bugs
+
+### Hardware Options
+- Two sim instances on same machine (different ports: 9091, 9093)
+  - Risk: GPU memory pressure from two Unity instances
+- Second sim on remote machine
+  - gym_donkeycar supports `host` parameter in conf
+  - Previous connection issues to remote host need debugging
+
+### Image Augmentation (complementary, not primary)
+DonkeyCar sim has built-in augmentation options:
+- Gaussian blur, image flipping, cropping
+- Other donkeycar users use these for generalization
+- Solves visual robustness (lighting, noise) but NOT track geometry diversity
+- Best used TOGETHER with parallel multi-track training
+
+### Warm Start Failure Re-Analysis
+Previously tried warm-starting from generated_road champion onto multi-track
+training. This failed — but it used the broken close_and_switch methodology.
+The warm start itself may not have been the problem. Worth retrying once
+parallel envs are working.
+
+## Exp 10 Evaluation Results (re-run 2026-04-19)
+
+| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
+|---|---|---|---|---|---|
+| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot |
+| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes immediately |
+| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early |
+| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |
+
+## Next Steps
+- **Exp 11:** Test parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
+- First: verify we can connect to both sims simultaneously
+- Then: train with both tracks in parallel, same hyperparameters as Trial 9
+- Goal: consistent results (not lottery), measured over multiple runs