docs: session log + ADR-019 — parallel DummyVecEnv for multi-track training
This commit is contained in:
parent
db1274174f
commit
86357622e3
43
DECISIONS.md
43
DECISIONS.md
|
|
@ -373,3 +373,46 @@ positional progress, not collision contact. This is the correct signal.
|
||||||
|
|
||||||
**Tuning note:** stuck_steps=80 (~5 seconds at 16 steps/sec). Could be
|
**Tuning note:** stuck_steps=80 (~5 seconds at 16 steps/sec). Could be
|
||||||
reduced to 40 (~2.5 seconds) if stuck periods are observably long.
|
reduced to 40 (~2.5 seconds) if stuck periods are observably long.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ADR-019: Parallel DummyVecEnv for Multi-Track Training (Not Close-and-Switch)
|
||||||
|
|
||||||
|
**Date:** 2026-04-19
|
||||||
|
**Status:** Proposed (to be validated by Exp 11)
|
||||||
|
|
||||||
|
**Context:** Multi-track training via close_and_switch() — closing the env,
|
||||||
|
reopening on a new track, calling model.set_env() — produced unreliable
|
||||||
|
results. Wave 4 had 25 trials: only 4/25 scored >500, median 111.
|
||||||
|
Exp 10 used nearly identical hyperparameters to the best Wave 4 trial
|
||||||
|
and failed completely (crashes <180 steps on all tracks).
|
||||||
|
|
||||||
|
Root cause: PPO is an on-policy algorithm. Its rollout buffer, value
|
||||||
|
function estimates, and advantage calculations are disrupted when the
|
||||||
|
environment is swapped mid-training. The model catastrophically forgets
|
||||||
|
one track while training on another.
|
||||||
|
|
||||||
|
**Decision:** Use SB3's DummyVecEnv with one env per track, each connected
|
||||||
|
to a separate sim instance on a different port. PPO collects experience
|
||||||
|
from ALL tracks in every rollout batch — no switching, no forgetting.
|
||||||
|
|
||||||
|
```python
|
||||||
|
env = DummyVecEnv([
|
||||||
|
lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
|
||||||
|
lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
|
||||||
|
])
|
||||||
|
```
|
||||||
|
|
||||||
|
**Consequences:**
|
||||||
|
- Requires multiple sim instances (one per training track)
|
||||||
|
- More GPU/CPU usage — can be mitigated by running sims on separate machines
|
||||||
|
- PPO sees both tracks in every batch — no catastrophic forgetting
|
||||||
|
- No env close/reopen — stable training throughout
|
||||||
|
- This is how SB3 is designed to work with multiple environments
|
||||||
|
|
||||||
|
**Rejected alternatives:**
|
||||||
|
- close_and_switch (current) — disrupts PPO, 80% failure rate
|
||||||
|
- Same-connection scene switching — untested, still sequential, fragile
|
||||||
|
|
||||||
|
**Validation:** Exp 11 will test this approach. If results are consistent
|
||||||
|
across multiple runs (not lottery), this ADR is confirmed.
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,120 @@
|
||||||
|
# Session Log — 2026-04-19
|
||||||
|
|
||||||
|
## Key Discovery: Why Multi-Track Training Fails
|
||||||
|
|
||||||
|
### The Problem
|
||||||
|
Our multi-track training uses `close_and_switch()` which:
|
||||||
|
1. Closes the TCP connection to the sim
|
||||||
|
2. Sends `exit_scene` to go back to menu
|
||||||
|
3. Opens a NEW connection on a different track
|
||||||
|
4. Calls `model.set_env(new_env)` to swap the environment
|
||||||
|
|
||||||
|
This disrupts PPO's training because:
|
||||||
|
- PPO's rollout buffer contains partial experience from the old track
|
||||||
|
- The value function estimates become wrong for the new track
|
||||||
|
- The advantage calculations (which drive PPO's policy updates) are corrupted
|
||||||
|
- Every switch is like ripping out a student's notebook mid-lesson
|
||||||
|
|
||||||
|
### Evidence
|
||||||
|
- **Wave 4:** 25 trials with this methodology. Only 4/25 (16%) scored >500.
|
||||||
|
Median score 111. Trial 9 scored 1435 but was a lucky outlier.
|
||||||
|
- **Exp 10:** Same code, nearly identical hyperparameters to Trial 9.
|
||||||
|
Total failure — crashes on all tracks at <180 steps.
|
||||||
|
- **Conclusion:** Trial 9's success was random weight initialization luck,
|
||||||
|
not evidence the method works.
|
||||||
|
|
||||||
|
### The Fix: Parallel Environments (DummyVecEnv)
|
||||||
|
|
||||||
|
SB3's `DummyVecEnv` can wrap multiple gym environments. PPO collects
|
||||||
|
experience from ALL environments in every rollout batch. No switching,
|
||||||
|
no closing, no disruption.
|
||||||
|
|
||||||
|
```python
|
||||||
|
env = DummyVecEnv([
|
||||||
|
lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
|
||||||
|
lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
|
||||||
|
])
|
||||||
|
env = VecTransposeImage(env)
|
||||||
|
model = PPO('CnnPolicy', env, ...)
|
||||||
|
model.learn(total_timesteps=90000) # both tracks in EVERY batch
|
||||||
|
```
|
||||||
|
|
||||||
|
This requires two sim instances on different ports (one track per sim),
|
||||||
|
but gives PPO a stable, consistent training setup — exactly how SB3 is
|
||||||
|
designed to work with multiple environments.
|
||||||
|
|
||||||
|
### How DummyVecEnv Works (for future reference)
|
||||||
|
|
||||||
|
PPO training loop (simplified):
|
||||||
|
```
|
||||||
|
for each rollout batch:
|
||||||
|
for each of N steps in rollout:
|
||||||
|
for each env in DummyVecEnv: ← env[0]=generated_track, env[1]=mountain_track
|
||||||
|
action = policy(observation)
|
||||||
|
next_obs, reward, done = env.step(action)
|
||||||
|
store (obs, action, reward, done) in buffer
|
||||||
|
|
||||||
|
compute advantages using value function
|
||||||
|
update policy using all experience from ALL envs
|
||||||
|
```
|
||||||
|
|
||||||
|
Key insight: the model doesn't "know" which track it's on. It just sees
|
||||||
|
images and learns a policy that works across all the images it sees.
|
||||||
|
Both tracks contribute to every policy update. This prevents catastrophic
|
||||||
|
forgetting because the model never stops seeing either track.
|
||||||
|
|
||||||
|
With close_and_switch: model trains on track A for 6000 steps, completely
|
||||||
|
forgets track A while training on track B for 6000 steps, etc. Classic
|
||||||
|
catastrophic interference.
|
||||||
|
|
||||||
|
With DummyVecEnv: model sees both tracks simultaneously in every batch.
|
||||||
|
Like a human alternating laps between two courses — never forgets either one.
|
||||||
|
|
||||||
|
### Alternative: Same Env, Switch Track Scene
|
||||||
|
|
||||||
|
Theoretically possible: keep TCP connection open, send `exit_scene` then
|
||||||
|
`load_scene(new_track)` without closing the gym env. The observation and
|
||||||
|
action spaces are identical across tracks so SB3 wouldn't notice.
|
||||||
|
|
||||||
|
Concerns:
|
||||||
|
- gym_donkeycar's DonkeyEnv initializes scene in __init__, not designed
|
||||||
|
for mid-session scene changes
|
||||||
|
- The viewer/sim controller state machine may not handle re-loading cleanly
|
||||||
|
- Still sequential (not parallel) so still has the forgetting problem,
|
||||||
|
just without the env close/reopen disruption
|
||||||
|
- Untested — could introduce subtle bugs
|
||||||
|
|
||||||
|
### Hardware Options
|
||||||
|
- Two sim instances on same machine (different ports: 9091, 9093)
|
||||||
|
- Risk: GPU memory pressure from two Unity instances
|
||||||
|
- Second sim on remote machine
|
||||||
|
- gym_donkeycar supports `host` parameter in conf
|
||||||
|
- Previous connection issues to remote host need debugging
|
||||||
|
|
||||||
|
### Image Augmentation (complementary, not primary)
|
||||||
|
DonkeyCar sim has built-in augmentation options:
|
||||||
|
- Gaussian blur, image flipping, cropping
|
||||||
|
- Other donkeycar users use these for generalization
|
||||||
|
- Solves visual robustness (lighting, noise) but NOT track geometry diversity
|
||||||
|
- Best used TOGETHER with parallel multi-track training
|
||||||
|
|
||||||
|
### Warm Start Failure Re-Analysis
|
||||||
|
Previously tried warm-starting from generated_road champion onto multi-track
|
||||||
|
training. This failed — but it used the broken close_and_switch methodology.
|
||||||
|
The warm start itself may not have been the problem. Worth retrying once
|
||||||
|
parallel envs are working.
|
||||||
|
|
||||||
|
## Exp 10 Evaluation Results (re-run 2026-04-19)
|
||||||
|
|
||||||
|
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|
||||||
|
|---|---|---|---|---|---|
|
||||||
|
| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot |
|
||||||
|
| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes immediately |
|
||||||
|
| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early |
|
||||||
|
| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
- **Exp 11:** Test parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
|
||||||
|
- First: verify we can connect to both sims simultaneously
|
||||||
|
- Then: train with both tracks in parallel, same hyperparameters as Trial 9
|
||||||
|
- Goal: consistent results (not lottery), measured over multiple runs
|
||||||
Loading…
Reference in New Issue