donkeycar-rl-autoresearch/docs/SESSION_LOG_2026-04-19.md

# Session Log — 2026-04-19

## Key Discovery: Why Multi-Track Training Fails

### The Problem
Our multi-track training uses `close_and_switch()` which:
1. Closes the TCP connection to the sim
2. Sends `exit_scene` to go back to menu
3. Opens a NEW connection on a different track
4. Calls `model.set_env(new_env)` to swap the environment

This disrupts PPO's training because:
- PPO's rollout buffer contains partial experience from the old track
- The value function estimates become wrong for the new track
- The advantage calculations (which drive PPO's policy updates) are corrupted
- Every switch is like ripping out a student's notebook mid-lesson

### Evidence
- **Wave 4:** 25 trials with this methodology. Only 4/25 (16%) scored >500.
  Median score 111. Trial 9 scored 1435 but was a lucky outlier.
- **Exp 10:** Same code, nearly identical hyperparameters to Trial 9.
  Total failure — crashes on all tracks at <180 steps.
- **Conclusion:** Trial 9's success was random weight initialization luck,
  not evidence the method works.

### The Fix: Parallel Environments (DummyVecEnv)

SB3's `DummyVecEnv` can wrap multiple gym environments. PPO collects
experience from ALL environments in every rollout batch. No switching,
no closing, no disruption.

```python
env = DummyVecEnv([
    lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
    lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
])
env = VecTransposeImage(env)
model = PPO('CnnPolicy', env, ...)
model.learn(total_timesteps=90000)  # both tracks in EVERY batch
```

This requires two sim instances on different ports (one track per sim),
but gives PPO a stable, consistent training setup — exactly how SB3 is
designed to work with multiple environments.

### How DummyVecEnv Works (for future reference)

PPO training loop (simplified):
```
for each rollout batch:
    for each of N steps in rollout:
        for each env in DummyVecEnv:     ← env[0]=generated_track, env[1]=mountain_track
            action = policy(observation)
            next_obs, reward, done = env.step(action)
            store (obs, action, reward, done) in buffer

    compute advantages using value function
    update policy using all experience from ALL envs
```

Key insight: the model doesn't "know" which track it's on. It just sees
images and learns a policy that works across all the images it sees.
Both tracks contribute to every policy update. This prevents catastrophic
forgetting because the model never stops seeing either track.

With close_and_switch: model trains on track A for 6000 steps, completely
forgets track A while training on track B for 6000 steps, etc. Classic
catastrophic interference.

With DummyVecEnv: model sees both tracks simultaneously in every batch.
Like a human alternating laps between two courses — never forgets either one.

### Alternative: Same Env, Switch Track Scene

Theoretically possible: keep TCP connection open, send `exit_scene` then
`load_scene(new_track)` without closing the gym env. The observation and
action spaces are identical across tracks so SB3 wouldn't notice.

Concerns:
- gym_donkeycar's DonkeyEnv initializes scene in __init__, not designed
  for mid-session scene changes
- The viewer/sim controller state machine may not handle re-loading cleanly
- Still sequential (not parallel) so still has the forgetting problem,
  just without the env close/reopen disruption
- Untested — could introduce subtle bugs

### Hardware Options
- Two sim instances on same machine (different ports: 9091, 9093)
  - Risk: GPU memory pressure from two Unity instances
- Second sim on remote machine
  - gym_donkeycar supports `host` parameter in conf
  - Previous connection issues to remote host need debugging

### Image Augmentation (complementary, not primary)
DonkeyCar sim has built-in augmentation options:
- Gaussian blur, image flipping, cropping
- Other donkeycar users use these for generalization
- Solves visual robustness (lighting, noise) but NOT track geometry diversity
- Best used TOGETHER with parallel multi-track training

### Warm Start Failure Re-Analysis
Previously tried warm-starting from generated_road champion onto multi-track
training. This failed — but it used the broken close_and_switch methodology.
The warm start itself may not have been the problem. Worth retrying once
parallel envs are working.

## Exp 10 Evaluation Results (re-run 2026-04-19)

| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|---|---|---|---|---|---|
| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot |
| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes immediately |
| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early |
| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |

## Next Steps
- **Exp 11:** Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
  - Exp 11 (v5 reward): aborted due to circular driving on generated_track
  - Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
- **v6 reward confirmed:** efficiency gate prevents circles, tests pass
- **Parallel env confirmed:** mechanically sound, stable training
- **Open issue:** 90k steps may be insufficient for 2-env training (45k per track)
- **Next experiment ideas:**
  - Increase to 180k-250k total steps
  - Test v6 on single track to isolate reward effect
  - Check if efficiency gate fires during normal cornering (false positives)