docs: Exp 11 + 11b results — parallel envs work, v6 prevents circles, but plateaus at ~194 steps

Exp 11 (v5 reward): aborted at 66k — circular driving returned without efficiency term Exp 11b (v6 reward): completed 90k — no circles but plateaus at 170-195 steps All 4 tracks eval: remarkably consistent ~194 steps (including zero-shot) Parallel DummyVecEnv infrastructure proven stable. Next: increase training budget (90k may be insufficient for 2 parallel envs).
2026-04-19 13:26:29 -04:00 · 2026-04-19 13:26:29 -04:00 · 0993d4f1e7
parent 91ce8fc1fa
commit 0993d4f1e7
2 changed files with 76 additions and 4 deletions
--- a/docs/SESSION_LOG_2026-04-19.md
+++ b/docs/SESSION_LOG_2026-04-19.md
@ -114,7 +114,13 @@ parallel envs are working.
 | mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |

 ## Next Steps
- **Exp 11:** Test parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
- First: verify we can connect to both sims simultaneously
- Then: train with both tracks in parallel, same hyperparameters as Trial 9
- Goal: consistent results (not lottery), measured over multiple runs
+- **Exp 11:** Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
+  - Exp 11 (v5 reward): aborted due to circular driving on generated_track
+  - Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
+- **v6 reward confirmed:** efficiency gate prevents circles, tests pass
+- **Parallel env confirmed:** mechanically sound, stable training
+- **Open issue:** 90k steps may be insufficient for 2-env training (45k per track)
+- **Next experiment ideas:**
+  - Increase to 180k-250k total steps
+  - Test v6 on single track to isolate reward effect
+  - Check if efficiency gate fires during normal cornering (false positives)
--- a/docs/TEST_HISTORY.md
+++ b/docs/TEST_HISTORY.md
@ -334,3 +334,69 @@ the track-switching was the problem. Result stored in `models/wave5-gentrack-onl
 should we focus on single-track training with domain randomization (lighting,
 camera angle) to achieve generalization instead?

+### Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED)
+- **Date:** 2026-04-19
+- **Change from Exp10:** Two sim instances (port 9091 + 9093), DummyVecEnv wraps both.
+  PPO sees both tracks in every rollout batch. No close_and_switch.
+- **Tracks:** generated_track (9091) + mountain_track (9093)
+- **Reward:** v5 (speed × CTE) — same as Exp 9/10
+- **Result:** ABORTED at 66k/90k steps. Circular driving observed on generated_track.
+  v5 reward has no efficiency term → circles at CTE≈0 earn positive reward.
+- **Positive:** Parallel env infrastructure works! Both sims connected, PPO trained
+  stably with no env switching issues. Consistent improvement 14.7→67.8 combined.
+- **Negative:** Circular driving exploit returned because v5 dropped efficiency.
+
+### Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate)
+- **Date:** 2026-04-19
+- **Change from Exp11:** Reward v6 (speed × CTE + efficiency gate ≥ 0.15).
+  Also stuck_steps 80→40 (faster stuck termination).
+- **Tracks:** generated_track (9091) + mountain_track (9093)
+- **Total steps:** 90,000 | lr=0.000725 | throttle_min=0.2
+
+**Training progress (eval at each 6k checkpoint):**
+
+| Steps | gen_track | mountain | Combined | Note |
+|---|---|---|---|---|
+| 6k | 91s | 130s | 10.7r | Early |
+| 18k | 100s | 100s | 15.9r | Improving |
+| 36k | 161s | 160s | 26.2r | ⭐ |
+| 42k | 160s | 159s | 28.9r | ⭐ |
+| 60k | 164s | 163s | — | Plateau |
+| 78k | 169s | 168s | 29.2r | ⭐ |
+| 90k | 173s | 172s | — | End |
+
+**Evaluation results (best_model, 3 sets per track):**
+
+| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
+|---|---|---|---|---|---|
+| mountain_track (trained) | 195 | 196 | 192 | **194** | ❌ |
+| generated_track (trained) | 192 | 194 | 192 | **193** | ❌ |
+| generated_road (zero-shot) | 192 | 196 | 194 | **194** | ❌ |
+| mini_monaco (zero-shot) | 194 | 192 | 196 | **194** | ❌ |
+
+**Analysis:**
+- ✅ No circular driving (efficiency gate works)
+- ✅ Remarkably consistent: all tracks ~194 steps, very low variance
+- ✅ Parallel env infrastructure is stable and reliable
+- ❌ Model plateaus at ~170-195 steps and never improves past that
+- ❌ Much worse than Exp 9 (mountain only: 2000/2000) or Wave 4 Trial 9 (2000/2000)
+- The consistency across all 4 tracks (including zero-shot) suggests the model
+  learned a generic short-drive policy, not track-specific features
+- Possible cause: 90k steps may be insufficient for 2-env parallel training
+  (effective steps per track = 45k each), or the efficiency gate may be
+  suppressing early exploration
+
+**Key findings:**
+1. Parallel DummyVecEnv works mechanically — this is the right infrastructure
+2. v6 reward prevents circular driving
+3. But 90k steps with 2 parallel envs may not be enough training budget
+4. Compare: Exp 9 (single track, 90k steps, v5) → 2000 steps. Exp 11b
+   (2 tracks, 90k steps, v6) → 194 steps. The training budget per track
+   is halved AND the reward is harder to exploit.
+
+**Next experiments to consider:**
+- Increase total_timesteps to 180k-250k (restore per-track budget)
+- Try v6 reward on single track first to isolate reward vs multi-track effects
+- Try v5 reward with parallel envs but longer training (accept some circling)
+- Check if efficiency gate triggers too aggressively during normal cornering
+