docs: Exp 11 + 11b results — parallel envs work, v6 prevents circles, but plateaus at ~194 steps

Exp 11 (v5 reward): aborted at 66k — circular driving returned without efficiency term Exp 11b (v6 reward): completed 90k — no circles but plateaus at 170-195 steps All 4 tracks eval: remarkably consistent ~194 steps (including zero-shot) Parallel DummyVecEnv infrastructure proven stable. Next: increase training budget (90k may be insufficient for 2 parallel envs).
2026-04-19 13:26:29 -04:00 · 2026-04-19 13:26:29 -04:00 · 0993d4f1e7
parent 91ce8fc1fa
commit 0993d4f1e7
2 changed files with 76 additions and 4 deletions
--- a/docs/SESSION_LOG_2026-04-19.md
+++ b/docs/SESSION_LOG_2026-04-19.md
@ -114,7 +114,13 @@ parallel envs are working.
 | mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |
 ## Next Steps
- **Exp 11:** Test parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
+- **Exp 11:** Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
- First: verify we can connect to both sims simultaneously
+  - Exp 11 (v5 reward): aborted due to circular driving on generated_track
- Then: train with both tracks in parallel, same hyperparameters as Trial 9
+  - Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
- Goal: consistent results (not lottery), measured over multiple runs
+- **v6 reward confirmed:** efficiency gate prevents circles, tests pass
 - **Parallel env confirmed:** mechanically sound, stable training
 - **Open issue:** 90k steps may be insufficient for 2-env training (45k per track)
 - **Next experiment ideas:**
  - Increase to 180k-250k total steps
  - Test v6 on single track to isolate reward effect
  - Check if efficiency gate fires during normal cornering (false positives)
--- a/docs/TEST_HISTORY.md
+++ b/docs/TEST_HISTORY.md
@ -334,3 +334,69 @@ the track-switching was the problem. Result stored in `models/wave5-gentrack-onl
 should we focus on single-track training with domain randomization (lighting,
 camera angle) to achieve generalization instead?
 ### Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED)
 - **Date:** 2026-04-19
 - **Change from Exp10:** Two sim instances (port 9091 + 9093), DummyVecEnv wraps both.
  PPO sees both tracks in every rollout batch. No close_and_switch.
 - **Tracks:** generated_track (9091) + mountain_track (9093)
 - **Reward:** v5 (speed × CTE) — same as Exp 9/10
 - **Result:** ABORTED at 66k/90k steps. Circular driving observed on generated_track.
  v5 reward has no efficiency term → circles at CTE≈0 earn positive reward.
 - **Positive:** Parallel env infrastructure works! Both sims connected, PPO trained
  stably with no env switching issues. Consistent improvement 14.7→67.8 combined.
 - **Negative:** Circular driving exploit returned because v5 dropped efficiency.
 ### Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate)
 - **Date:** 2026-04-19
 - **Change from Exp11:** Reward v6 (speed × CTE + efficiency gate ≥ 0.15).
  Also stuck_steps 80→40 (faster stuck termination).
 - **Tracks:** generated_track (9091) + mountain_track (9093)
 - **Total steps:** 90,000 | lr=0.000725 | throttle_min=0.2
 **Training progress (eval at each 6k checkpoint):**
 | Steps | gen_track | mountain | Combined | Note |
 |---|---|---|---|---|
 | 6k | 91s | 130s | 10.7r | Early |
 | 18k | 100s | 100s | 15.9r | Improving |
 | 36k | 161s | 160s | 26.2r | ⭐ |
 | 42k | 160s | 159s | 28.9r | ⭐ |
 | 60k | 164s | 163s | — | Plateau |
 | 78k | 169s | 168s | 29.2r | ⭐ |
 | 90k | 173s | 172s | — | End |
 **Evaluation results (best_model, 3 sets per track):**
 | Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
 |---|---|---|---|---|---|
 | mountain_track (trained) | 195 | 196 | 192 | **194** | ❌ |
 | generated_track (trained) | 192 | 194 | 192 | **193** | ❌ |
 | generated_road (zero-shot) | 192 | 196 | 194 | **194** | ❌ |
 | mini_monaco (zero-shot) | 194 | 192 | 196 | **194** | ❌ |
 **Analysis:**
 - ✅ No circular driving (efficiency gate works)
 - ✅ Remarkably consistent: all tracks ~194 steps, very low variance
 - ✅ Parallel env infrastructure is stable and reliable
 - ❌ Model plateaus at ~170-195 steps and never improves past that
 - ❌ Much worse than Exp 9 (mountain only: 2000/2000) or Wave 4 Trial 9 (2000/2000)
 - The consistency across all 4 tracks (including zero-shot) suggests the model
  learned a generic short-drive policy, not track-specific features
 - Possible cause: 90k steps may be insufficient for 2-env parallel training
  (effective steps per track = 45k each), or the efficiency gate may be
  suppressing early exploration
 **Key findings:**
 1. Parallel DummyVecEnv works mechanically — this is the right infrastructure
 2. v6 reward prevents circular driving
 3. But 90k steps with 2 parallel envs may not be enough training budget
 4. Compare: Exp 9 (single track, 90k steps, v5) → 2000 steps. Exp 11b
   (2 tracks, 90k steps, v6) → 194 steps. The training budget per track
   is halved AND the reward is harder to exploit.
 **Next experiments to consider:**
 - Increase total_timesteps to 180k-250k (restore per-track budget)
 - Try v6 reward on single track first to isolate reward vs multi-track effects
 - Try v5 reward with parallel envs but longer training (accept some circling)
 - Check if efficiency gate triggers too aggressively during normal cornering