donkeycar-rl-autoresearch/docs/SESSION_LOG_2026-04-19.md

177 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Session Log — 2026-04-19
## Key Discovery: Why Multi-Track Training Fails
### The Problem
Our multi-track training uses `close_and_switch()` which:
1. Closes the TCP connection to the sim
2. Sends `exit_scene` to go back to menu
3. Opens a NEW connection on a different track
4. Calls `model.set_env(new_env)` to swap the environment
This disrupts PPO's training because:
- PPO's rollout buffer contains partial experience from the old track
- The value function estimates become wrong for the new track
- The advantage calculations (which drive PPO's policy updates) are corrupted
- Every switch is like ripping out a student's notebook mid-lesson
### Evidence
- **Wave 4:** 25 trials with this methodology. Only 4/25 (16%) scored >500.
Median score 111. Trial 9 scored 1435 but was a lucky outlier.
- **Exp 10:** Same code, nearly identical hyperparameters to Trial 9.
Total failure — crashes on all tracks at <180 steps.
- **Conclusion:** Trial 9's success was random weight initialization luck,
not evidence the method works.
### The Fix: Parallel Environments (DummyVecEnv)
SB3's `DummyVecEnv` can wrap multiple gym environments. PPO collects
experience from ALL environments in every rollout batch. No switching,
no closing, no disruption.
```python
env = DummyVecEnv([
lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
])
env = VecTransposeImage(env)
model = PPO('CnnPolicy', env, ...)
model.learn(total_timesteps=90000) # both tracks in EVERY batch
```
This requires two sim instances on different ports (one track per sim),
but gives PPO a stable, consistent training setup exactly how SB3 is
designed to work with multiple environments.
### How DummyVecEnv Works (for future reference)
PPO training loop (simplified):
```
for each rollout batch:
for each of N steps in rollout:
for each env in DummyVecEnv: ← env[0]=generated_track, env[1]=mountain_track
action = policy(observation)
next_obs, reward, done = env.step(action)
store (obs, action, reward, done) in buffer
compute advantages using value function
update policy using all experience from ALL envs
```
Key insight: the model doesn't "know" which track it's on. It just sees
images and learns a policy that works across all the images it sees.
Both tracks contribute to every policy update. This prevents catastrophic
forgetting because the model never stops seeing either track.
With close_and_switch: model trains on track A for 6000 steps, completely
forgets track A while training on track B for 6000 steps, etc. Classic
catastrophic interference.
With DummyVecEnv: model sees both tracks simultaneously in every batch.
Like a human alternating laps between two courses never forgets either one.
### Alternative: Same Env, Switch Track Scene
Theoretically possible: keep TCP connection open, send `exit_scene` then
`load_scene(new_track)` without closing the gym env. The observation and
action spaces are identical across tracks so SB3 wouldn't notice.
Concerns:
- gym_donkeycar's DonkeyEnv initializes scene in __init__, not designed
for mid-session scene changes
- The viewer/sim controller state machine may not handle re-loading cleanly
- Still sequential (not parallel) so still has the forgetting problem,
just without the env close/reopen disruption
- Untested could introduce subtle bugs
### Hardware Options
- Two sim instances on same machine (different ports: 9091, 9093)
- Risk: GPU memory pressure from two Unity instances
- Second sim on remote machine
- gym_donkeycar supports `host` parameter in conf
- Previous connection issues to remote host need debugging
### Image Augmentation (complementary, not primary)
DonkeyCar sim has built-in augmentation options:
- Gaussian blur, image flipping, cropping
- Other donkeycar users use these for generalization
- Solves visual robustness (lighting, noise) but NOT track geometry diversity
- Best used TOGETHER with parallel multi-track training
### Warm Start Failure Re-Analysis
Previously tried warm-starting from generated_road champion onto multi-track
training. This failed but it used the broken close_and_switch methodology.
The warm start itself may not have been the problem. Worth retrying once
parallel envs are working.
## Exp 10 Evaluation Results (re-run 2026-04-19)
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|---|---|---|---|---|---|
| mountain_track (trained) | 178 | 179 | 179 | **179** | Crashes at same spot |
| generated_track (trained) | 99 | 82 | 88 | **90** | Crashes immediately |
| generated_road (zero-shot) | 135 | 223 | 105 | **154** | Crashes early |
| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | Crashes early |
## Next Steps
- **Exp 11:** Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
- Exp 11 (v5 reward): aborted due to circular driving on generated_track
- Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
- Exp 11c (v6 reward, 250k): aborted grass exploit found on generated_track
- Exp 11d: pending fixes before re-run
## Critical Known Facts (DO NOT LOSE)
### throttle_min history (from Exp 1-9)
- `throttle_min=0.2` alone: car cannot get over mountain_track hill (not enough power)
- `throttle_min=0.5`: car gets over hill BUT throttle is baked into action space,
model CANNOT output throttle < 0.5, crashes on tight corners (mini_monaco ~91 steps)
- `throttle_min=0.2` + v5 reward (speed×CTE): car CAN learn to self-select high
throttle on hill. Proved in Exp 9 (mountain only, 90k steps) 2000/2000 steps.
- KEY INSIGHT: Exp 9 worked because 90k steps were ALL on mountain. In parallel setup
(Exp 11b/11c), each track gets only ~45k effective steps AND the grass exploit
contaminated training. Mountain failure in parallel runs is NOT purely a throttle
issue fix the grass exploit first, THEN see if mountain learns.
### The grass exploit root cause (found 2026-04-19)
- generated_track has a physical gap in the boundary mesh at the first turn
- Car drives through the gap, CTE exceeds 8.0m sim should terminate
- BUT: `determine_episode_over()` in donkey_sim.py has this code:
```python
if math.fabs(self.cte) > 2 * self.max_cte: # > 16.0m
pass # ← INTENTIONALLY DOES NOTHING
elif math.fabs(self.cte) > self.max_cte: # 8.016.0m
self.over = True
```
- Car quickly exceeds 16m (> 2×max_cte), hits the `pass` case — episode never ends
- Fix: Python-side CTE patience wrapper that terminates when CTE > 4.0m for 20 steps
(catches the car BEFORE it blows past 16m)
### Parallel env episode asymmetry
- DummyVecEnv runs both envs in every step (sequential, not truly parallel)
- When mountain episode ends quickly, VecEnv auto-resets mountain and starts new episode
- Meanwhile generated_track episode continues
- During training (model.learn()): PPO collects experience from both and auto-resets
independently — this is fine and correct
- During eval: our eval loop uses done_mask, so short mountain episodes auto-reset
and start new episodes that we ignore (waiting for generated_track to finish)
- User observation: 'car waits at start line for generated_track episode to end' — correct
### DO NOT confuse mountain rollback with stuck issue
- Mountain rollback (car goes up, slows, rolls back) is a LEARNING/REWARD issue
- It is NOT a stuck issue — the car is moving (rolling back = speed > 0)
- StuckTerminationWrapper correctly does NOT fire (car IS moving)
- Root fix: ensure training is not contaminated by other exploits, then the
v5/v6 speed gradient teaches the model to apply high throttle on the hill
(proved to work in Exp 9)
- DO NOT add termination conditions for rollback — they interfere with valid
slow hill-climbing learning
### speed vs forward_vel in reward
- info['speed'] comes from Unity — scalar magnitude, always ≥ 0
- info['forward_vel'] computed in Python — dot(heading, velocity), negative when reversing
- Our reward uses info['speed'] — car rolling backward gets positive reward
- Sim's own reward correctly uses forward_vel with `if forward_vel > 0.0` check
- This is a known issue but NOT the primary cause of current problems
(efficiency gate gives 0 reward when rolling back → net displacement ≈ 0)