271 lines
12 KiB
Markdown
271 lines
12 KiB
Markdown
# Session Log — 2026-04-19
|
||
|
||
## Key Discovery: Why Multi-Track Training Fails
|
||
|
||
### The Problem
|
||
Our multi-track training uses `close_and_switch()` which:
|
||
1. Closes the TCP connection to the sim
|
||
2. Sends `exit_scene` to go back to menu
|
||
3. Opens a NEW connection on a different track
|
||
4. Calls `model.set_env(new_env)` to swap the environment
|
||
|
||
This disrupts PPO's training because:
|
||
- PPO's rollout buffer contains partial experience from the old track
|
||
- The value function estimates become wrong for the new track
|
||
- The advantage calculations (which drive PPO's policy updates) are corrupted
|
||
- Every switch is like ripping out a student's notebook mid-lesson
|
||
|
||
### Evidence
|
||
- **Wave 4:** 25 trials with this methodology. Only 4/25 (16%) scored >500.
|
||
Median score 111. Trial 9 scored 1435 but was a lucky outlier.
|
||
- **Exp 10:** Same code, nearly identical hyperparameters to Trial 9.
|
||
Total failure — crashes on all tracks at <180 steps.
|
||
- **Conclusion:** Trial 9's success was random weight initialization luck,
|
||
not evidence the method works.
|
||
|
||
### The Fix: Parallel Environments (DummyVecEnv)
|
||
|
||
SB3's `DummyVecEnv` can wrap multiple gym environments. PPO collects
|
||
experience from ALL environments in every rollout batch. No switching,
|
||
no closing, no disruption.
|
||
|
||
```python
|
||
env = DummyVecEnv([
|
||
lambda: wrap_env(gym.make('donkey-generated-track-v0', conf={"port": 9091})),
|
||
lambda: wrap_env(gym.make('donkey-mountain-track-v0', conf={"port": 9093})),
|
||
])
|
||
env = VecTransposeImage(env)
|
||
model = PPO('CnnPolicy', env, ...)
|
||
model.learn(total_timesteps=90000) # both tracks in EVERY batch
|
||
```
|
||
|
||
This requires two sim instances on different ports (one track per sim),
|
||
but gives PPO a stable, consistent training setup — exactly how SB3 is
|
||
designed to work with multiple environments.
|
||
|
||
### How DummyVecEnv Works (for future reference)
|
||
|
||
PPO training loop (simplified):
|
||
```
|
||
for each rollout batch:
|
||
for each of N steps in rollout:
|
||
for each env in DummyVecEnv: ← env[0]=generated_track, env[1]=mountain_track
|
||
action = policy(observation)
|
||
next_obs, reward, done = env.step(action)
|
||
store (obs, action, reward, done) in buffer
|
||
|
||
compute advantages using value function
|
||
update policy using all experience from ALL envs
|
||
```
|
||
|
||
Key insight: the model doesn't "know" which track it's on. It just sees
|
||
images and learns a policy that works across all the images it sees.
|
||
Both tracks contribute to every policy update. This prevents catastrophic
|
||
forgetting because the model never stops seeing either track.
|
||
|
||
With close_and_switch: model trains on track A for 6000 steps, completely
|
||
forgets track A while training on track B for 6000 steps, etc. Classic
|
||
catastrophic interference.
|
||
|
||
With DummyVecEnv: model sees both tracks simultaneously in every batch.
|
||
Like a human alternating laps between two courses — never forgets either one.
|
||
|
||
### Alternative: Same Env, Switch Track Scene
|
||
|
||
Theoretically possible: keep TCP connection open, send `exit_scene` then
|
||
`load_scene(new_track)` without closing the gym env. The observation and
|
||
action spaces are identical across tracks so SB3 wouldn't notice.
|
||
|
||
Concerns:
|
||
- gym_donkeycar's DonkeyEnv initializes scene in __init__, not designed
|
||
for mid-session scene changes
|
||
- The viewer/sim controller state machine may not handle re-loading cleanly
|
||
- Still sequential (not parallel) so still has the forgetting problem,
|
||
just without the env close/reopen disruption
|
||
- Untested — could introduce subtle bugs
|
||
|
||
### Hardware Options
|
||
- Two sim instances on same machine (different ports: 9091, 9093)
|
||
- Risk: GPU memory pressure from two Unity instances
|
||
- Second sim on remote machine
|
||
- gym_donkeycar supports `host` parameter in conf
|
||
- Previous connection issues to remote host need debugging
|
||
|
||
### Image Augmentation (complementary, not primary)
|
||
DonkeyCar sim has built-in augmentation options:
|
||
- Gaussian blur, image flipping, cropping
|
||
- Other donkeycar users use these for generalization
|
||
- Solves visual robustness (lighting, noise) but NOT track geometry diversity
|
||
- Best used TOGETHER with parallel multi-track training
|
||
|
||
### Warm Start Failure Re-Analysis
|
||
Previously tried warm-starting from generated_road champion onto multi-track
|
||
training. This failed — but it used the broken close_and_switch methodology.
|
||
The warm start itself may not have been the problem. Worth retrying once
|
||
parallel envs are working.
|
||
|
||
## Exp 10 Evaluation Results (re-run 2026-04-19)
|
||
|
||
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|
||
|---|---|---|---|---|---|
|
||
| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot |
|
||
| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes immediately |
|
||
| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early |
|
||
| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |
|
||
|
||
## Next Steps
|
||
- **Exp 11:** Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
|
||
- Exp 11 (v5 reward): aborted due to circular driving on generated_track
|
||
- Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
|
||
- Exp 11c (v6 reward, 250k): aborted — grass exploit found on generated_track
|
||
- Exp 11d: pending fixes before re-run
|
||
|
||
## Mountain Track Finetune + Physics Investigation (2026-04-19 late session)
|
||
|
||
### Finetune outcome summary
|
||
- Created and ran `agent/experiments/exp14_finetune_v5.py`
|
||
- Warm-start source: `agent/models/exp14-mountain-v5/best_model.zip`
|
||
- Schedule used:
|
||
- phase 1: runtime throttle floor `0.4`
|
||
- phase 2: runtime throttle floor `0.2`
|
||
- Training later degraded badly; later checkpoints became poor / unstable
|
||
- Best usable finetune checkpoint was **not** the final model
|
||
|
||
### Robust checkpoint comparison on mountain_track
|
||
We ran a deterministic mountain-only comparison over 9 episodes per candidate.
|
||
Results saved to:
|
||
- `agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl`
|
||
- `agent/outerloop-results/mountain_candidate_eval_2026-04-19.md`
|
||
|
||
Winner:
|
||
- `agent/models/exp14-mountain-v5-finetune/checkpoint_0036000.zip`
|
||
- promoted copy:
|
||
- `agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip`
|
||
|
||
Key result:
|
||
- **ft_036k** achieved:
|
||
- 9/9 successful episodes
|
||
- 25 total laps across 9 episodes
|
||
- mean lap **27.93s**
|
||
- best lap **26.16s**
|
||
- This beat:
|
||
- original mountain champion for robustness
|
||
- earlier `0.4`-floor checkpoints for robustness
|
||
- later finetune checkpoints, which had degraded badly
|
||
|
||
### Mountain physics discovery in Unity sim
|
||
Unity source path confirmed:
|
||
- `/mnt/c/Users/Paul/Documents/projects/sdsandbox`
|
||
|
||
We found a likely real root cause for hill wheelspin:
|
||
- `sdsim/Assets/Scripts/WheelPhys.cs` scales wheel friction by the hit collider's physics material:
|
||
- `hit.collider.material.staticFriction * originalForwardStiffness`
|
||
- `mountain_track.unity` contains **4 explicit `Slippery` physics-material assignments** on the imported `long_road` FBX instance
|
||
- `Slippery.staticFriction = 0.1`
|
||
- `Road.staticFriction = 0.5`
|
||
- `Grippy.staticFriction = 0.66`
|
||
|
||
Interpretation:
|
||
- mountain road traction is likely much lower than normal road tracks
|
||
- this matches observed wheelspin / poor uphill progress / getting stuck on hills
|
||
|
||
We created a dedicated Unity investigation branch before changing anything:
|
||
- repo: `/mnt/c/Users/Paul/Documents/projects/sdsandbox`
|
||
- branch: `investigate-mountain-friction`
|
||
|
||
### Cross-track warm-start transfer tests (Exp 15 / Exp 16)
|
||
We tested whether the best single-track champions could be re-used as warm starts on the other track.
|
||
|
||
#### Exp 15 — mountain → generated
|
||
- Script: `agent/experiments/exp15_gentrack_from_mountain.py`
|
||
- Warm start:
|
||
- `agent/models/exp14-mountain-v5-finetune/best_robust_model_0036000.zip`
|
||
- Target track:
|
||
- `generated_track`
|
||
- Result: **failed**
|
||
- User-observed behavior:
|
||
- exploit-like behavior near start / first corner
|
||
- not driving proper laps
|
||
- Log evidence by ~25k steps:
|
||
- `[20,000] reward=45.0 steps=47 laps=0`
|
||
- `[25,000] reward=23.4 steps=30 laps=0`
|
||
- short exploit laps appeared in log (`6.5s`, `4.91s`)
|
||
- Conclusion:
|
||
- mountain policy prior does **not** transfer cleanly to generated-track in this setup
|
||
|
||
#### Exp 16 — generated → mountain
|
||
- Script: `agent/experiments/exp16_mountain_from_gentrack.py`
|
||
- Warm start:
|
||
- `agent/models/exp13-gentrack-v4/best_model.zip`
|
||
- Target track:
|
||
- `mountain_track`
|
||
- Result: **failed**
|
||
- Behavior:
|
||
- no meaningful hill learning
|
||
- repeated short crash pattern
|
||
- Log evidence deep into run:
|
||
- `[210,000] reward=10.2 steps=195 laps=0`
|
||
- `[215,000] reward=10.1 steps=193 laps=0`
|
||
- Conclusion:
|
||
- generated-track champion does **not** bootstrap mountain learning effectively in the current setup
|
||
|
||
Overall takeaway:
|
||
- Direct cross-track warm starts failed in **both** directions.
|
||
- This suggests the source policies are too specialized, or that mountain physics / reward differences are too large for naive transfer.
|
||
- For now, single-track champions remain useful as champions, but not as obvious warm-start initializations for the other track.
|
||
|
||
## Critical Known Facts (DO NOT LOSE)
|
||
|
||
### throttle_min history (from Exp 1-9)
|
||
- `throttle_min=0.2` alone: car cannot get over mountain_track hill (not enough power)
|
||
- `throttle_min=0.5`: car gets over hill BUT throttle is baked into action space,
|
||
model CANNOT output throttle < 0.5, crashes on tight corners (mini_monaco ~91 steps)
|
||
- `throttle_min=0.2` + v5 reward (speed×CTE): car CAN learn to self-select high
|
||
throttle on hill. Proved in Exp 9 (mountain only, 90k steps) → 2000/2000 steps.
|
||
- KEY INSIGHT: Exp 9 worked because 90k steps were ALL on mountain. In parallel setup
|
||
(Exp 11b/11c), each track gets only ~45k effective steps AND the grass exploit
|
||
contaminated training. Mountain failure in parallel runs is NOT purely a throttle
|
||
issue — fix the grass exploit first, THEN see if mountain learns.
|
||
|
||
### The grass exploit root cause (found 2026-04-19)
|
||
- generated_track has a physical gap in the boundary mesh at the first turn
|
||
- Car drives through the gap, CTE exceeds 8.0m → sim should terminate
|
||
- BUT: `determine_episode_over()` in donkey_sim.py has this code:
|
||
```python
|
||
if math.fabs(self.cte) > 2 * self.max_cte: # > 16.0m
|
||
pass # ← INTENTIONALLY DOES NOTHING
|
||
elif math.fabs(self.cte) > self.max_cte: # 8.0–16.0m
|
||
self.over = True
|
||
```
|
||
- Car quickly exceeds 16m (> 2×max_cte), hits the `pass` case — episode never ends
|
||
- Fix: Python-side CTE patience wrapper that terminates when CTE > 4.0m for 20 steps
|
||
(catches the car BEFORE it blows past 16m)
|
||
|
||
### Parallel env episode asymmetry
|
||
- DummyVecEnv runs both envs in every step (sequential, not truly parallel)
|
||
- When mountain episode ends quickly, VecEnv auto-resets mountain and starts new episode
|
||
- Meanwhile generated_track episode continues
|
||
- During training (model.learn()): PPO collects experience from both and auto-resets
|
||
independently — this is fine and correct
|
||
- During eval: our eval loop uses done_mask, so short mountain episodes auto-reset
|
||
and start new episodes that we ignore (waiting for generated_track to finish)
|
||
- User observation: 'car waits at start line for generated_track episode to end' — correct
|
||
|
||
### DO NOT confuse mountain rollback with stuck issue
|
||
- Mountain rollback (car goes up, slows, rolls back) is a LEARNING/REWARD issue
|
||
- It is NOT a stuck issue — the car is moving (rolling back = speed > 0)
|
||
- StuckTerminationWrapper correctly does NOT fire (car IS moving)
|
||
- Root fix: ensure training is not contaminated by other exploits, then the
|
||
v5/v6 speed gradient teaches the model to apply high throttle on the hill
|
||
(proved to work in Exp 9)
|
||
- DO NOT add termination conditions for rollback — they interfere with valid
|
||
slow hill-climbing learning
|
||
|
||
### speed vs forward_vel in reward
|
||
- info['speed'] comes from Unity — scalar magnitude, always ≥ 0
|
||
- info['forward_vel'] computed in Python — dot(heading, velocity), negative when reversing
|
||
- Our reward uses info['speed'] — car rolling backward gets positive reward
|
||
- Sim's own reward correctly uses forward_vel with `if forward_vel > 0.0` check
|
||
- This is a known issue but NOT the primary cause of current problems
|
||
(efficiency gate gives 0 reward when rolling back → net displacement ≈ 0)
|