diff --git a/docs/SESSION_LOG_2026-04-19.md b/docs/SESSION_LOG_2026-04-19.md index f947630..1536438 100644 --- a/docs/SESSION_LOG_2026-04-19.md +++ b/docs/SESSION_LOG_2026-04-19.md @@ -114,7 +114,13 @@ parallel envs are working. | mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early | ## Next Steps -- **Exp 11:** Test parallel DummyVecEnv with two sim instances (ports 9091 + 9093) -- First: verify we can connect to both sims simultaneously -- Then: train with both tracks in parallel, same hyperparameters as Trial 9 -- Goal: consistent results (not lottery), measured over multiple runs +- **Exp 11:** Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093) + - Exp 11 (v5 reward): aborted due to circular driving on generated_track + - Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks +- **v6 reward confirmed:** efficiency gate prevents circles, tests pass +- **Parallel env confirmed:** mechanically sound, stable training +- **Open issue:** 90k steps may be insufficient for 2-env training (45k per track) +- **Next experiment ideas:** + - Increase to 180k-250k total steps + - Test v6 on single track to isolate reward effect + - Check if efficiency gate fires during normal cornering (false positives) diff --git a/docs/TEST_HISTORY.md b/docs/TEST_HISTORY.md index c63440e..fcb6658 100644 --- a/docs/TEST_HISTORY.md +++ b/docs/TEST_HISTORY.md @@ -334,3 +334,69 @@ the track-switching was the problem. Result stored in `models/wave5-gentrack-onl should we focus on single-track training with domain randomization (lighting, camera angle) to achieve generalization instead? +### Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED) +- **Date:** 2026-04-19 +- **Change from Exp10:** Two sim instances (port 9091 + 9093), DummyVecEnv wraps both. + PPO sees both tracks in every rollout batch. No close_and_switch. +- **Tracks:** generated_track (9091) + mountain_track (9093) +- **Reward:** v5 (speed × CTE) — same as Exp 9/10 +- **Result:** ABORTED at 66k/90k steps. Circular driving observed on generated_track. + v5 reward has no efficiency term → circles at CTE≈0 earn positive reward. +- **Positive:** Parallel env infrastructure works! Both sims connected, PPO trained + stably with no env switching issues. Consistent improvement 14.7→67.8 combined. +- **Negative:** Circular driving exploit returned because v5 dropped efficiency. + +### Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate) +- **Date:** 2026-04-19 +- **Change from Exp11:** Reward v6 (speed × CTE + efficiency gate ≥ 0.15). + Also stuck_steps 80→40 (faster stuck termination). +- **Tracks:** generated_track (9091) + mountain_track (9093) +- **Total steps:** 90,000 | lr=0.000725 | throttle_min=0.2 + +**Training progress (eval at each 6k checkpoint):** + +| Steps | gen_track | mountain | Combined | Note | +|---|---|---|---|---| +| 6k | 91s | 130s | 10.7r | Early | +| 18k | 100s | 100s | 15.9r | Improving | +| 36k | 161s | 160s | 26.2r | ⭐ | +| 42k | 160s | 159s | 28.9r | ⭐ | +| 60k | 164s | 163s | — | Plateau | +| 78k | 169s | 168s | 29.2r | ⭐ | +| 90k | 173s | 172s | — | End | + +**Evaluation results (best_model, 3 sets per track):** + +| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict | +|---|---|---|---|---|---| +| mountain_track (trained) | 195 | 196 | 192 | **194** | ❌ | +| generated_track (trained) | 192 | 194 | 192 | **193** | ❌ | +| generated_road (zero-shot) | 192 | 196 | 194 | **194** | ❌ | +| mini_monaco (zero-shot) | 194 | 192 | 196 | **194** | ❌ | + +**Analysis:** +- ✅ No circular driving (efficiency gate works) +- ✅ Remarkably consistent: all tracks ~194 steps, very low variance +- ✅ Parallel env infrastructure is stable and reliable +- ❌ Model plateaus at ~170-195 steps and never improves past that +- ❌ Much worse than Exp 9 (mountain only: 2000/2000) or Wave 4 Trial 9 (2000/2000) +- The consistency across all 4 tracks (including zero-shot) suggests the model + learned a generic short-drive policy, not track-specific features +- Possible cause: 90k steps may be insufficient for 2-env parallel training + (effective steps per track = 45k each), or the efficiency gate may be + suppressing early exploration + +**Key findings:** +1. Parallel DummyVecEnv works mechanically — this is the right infrastructure +2. v6 reward prevents circular driving +3. But 90k steps with 2 parallel envs may not be enough training budget +4. Compare: Exp 9 (single track, 90k steps, v5) → 2000 steps. Exp 11b + (2 tracks, 90k steps, v6) → 194 steps. The training budget per track + is halved AND the reward is harder to exploit. + +**Next experiments to consider:** +- Increase total_timesteps to 180k-250k (restore per-track budget) +- Try v6 reward on single track first to isolate reward vs multi-track effects +- Try v5 reward with parallel envs but longer training (accept some circling) +- Check if efficiency gate triggers too aggressively during normal cornering +