docs: Exp 11 + 11b results — parallel envs work, v6 prevents circles, but plateaus at ~194 steps

Exp 11 (v5 reward): aborted at 66k — circular driving returned without efficiency term
Exp 11b (v6 reward): completed 90k — no circles but plateaus at 170-195 steps
All 4 tracks eval: remarkably consistent ~194 steps (including zero-shot)
Parallel DummyVecEnv infrastructure proven stable.
Next: increase training budget (90k may be insufficient for 2 parallel envs).
This commit is contained in:
Paul Huliganga 2026-04-19 13:26:29 -04:00
parent 91ce8fc1fa
commit 0993d4f1e7
2 changed files with 76 additions and 4 deletions

View File

@ -114,7 +114,13 @@ parallel envs are working.
| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early | | mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |
## Next Steps ## Next Steps
- **Exp 11:** Test parallel DummyVecEnv with two sim instances (ports 9091 + 9093) - **Exp 11:** Tested parallel DummyVecEnv with two sim instances (ports 9091 + 9093)
- First: verify we can connect to both sims simultaneously - Exp 11 (v5 reward): aborted due to circular driving on generated_track
- Then: train with both tracks in parallel, same hyperparameters as Trial 9 - Exp 11b (v6 reward): completed, no circles, but plateaus at ~194 steps on all tracks
- Goal: consistent results (not lottery), measured over multiple runs - **v6 reward confirmed:** efficiency gate prevents circles, tests pass
- **Parallel env confirmed:** mechanically sound, stable training
- **Open issue:** 90k steps may be insufficient for 2-env training (45k per track)
- **Next experiment ideas:**
- Increase to 180k-250k total steps
- Test v6 on single track to isolate reward effect
- Check if efficiency gate fires during normal cornering (false positives)

View File

@ -334,3 +334,69 @@ the track-switching was the problem. Result stored in `models/wave5-gentrack-onl
should we focus on single-track training with domain randomization (lighting, should we focus on single-track training with domain randomization (lighting,
camera angle) to achieve generalization instead? camera angle) to achieve generalization instead?
### Exp 11 — Parallel DummyVecEnv, v5 reward (ABORTED)
- **Date:** 2026-04-19
- **Change from Exp10:** Two sim instances (port 9091 + 9093), DummyVecEnv wraps both.
PPO sees both tracks in every rollout batch. No close_and_switch.
- **Tracks:** generated_track (9091) + mountain_track (9093)
- **Reward:** v5 (speed × CTE) — same as Exp 9/10
- **Result:** ABORTED at 66k/90k steps. Circular driving observed on generated_track.
v5 reward has no efficiency term → circles at CTE≈0 earn positive reward.
- **Positive:** Parallel env infrastructure works! Both sims connected, PPO trained
stably with no env switching issues. Consistent improvement 14.7→67.8 combined.
- **Negative:** Circular driving exploit returned because v5 dropped efficiency.
### Exp 11b — Parallel DummyVecEnv, v6 reward (anti-circle gate)
- **Date:** 2026-04-19
- **Change from Exp11:** Reward v6 (speed × CTE + efficiency gate ≥ 0.15).
Also stuck_steps 80→40 (faster stuck termination).
- **Tracks:** generated_track (9091) + mountain_track (9093)
- **Total steps:** 90,000 | lr=0.000725 | throttle_min=0.2
**Training progress (eval at each 6k checkpoint):**
| Steps | gen_track | mountain | Combined | Note |
|---|---|---|---|---|
| 6k | 91s | 130s | 10.7r | Early |
| 18k | 100s | 100s | 15.9r | Improving |
| 36k | 161s | 160s | 26.2r | ⭐ |
| 42k | 160s | 159s | 28.9r | ⭐ |
| 60k | 164s | 163s | — | Plateau |
| 78k | 169s | 168s | 29.2r | ⭐ |
| 90k | 173s | 172s | — | End |
**Evaluation results (best_model, 3 sets per track):**
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|---|---|---|---|---|---|
| mountain_track (trained) | 195 | 196 | 192 | **194** | ❌ |
| generated_track (trained) | 192 | 194 | 192 | **193** | ❌ |
| generated_road (zero-shot) | 192 | 196 | 194 | **194** | ❌ |
| mini_monaco (zero-shot) | 194 | 192 | 196 | **194** | ❌ |
**Analysis:**
- ✅ No circular driving (efficiency gate works)
- ✅ Remarkably consistent: all tracks ~194 steps, very low variance
- ✅ Parallel env infrastructure is stable and reliable
- ❌ Model plateaus at ~170-195 steps and never improves past that
- ❌ Much worse than Exp 9 (mountain only: 2000/2000) or Wave 4 Trial 9 (2000/2000)
- The consistency across all 4 tracks (including zero-shot) suggests the model
learned a generic short-drive policy, not track-specific features
- Possible cause: 90k steps may be insufficient for 2-env parallel training
(effective steps per track = 45k each), or the efficiency gate may be
suppressing early exploration
**Key findings:**
1. Parallel DummyVecEnv works mechanically — this is the right infrastructure
2. v6 reward prevents circular driving
3. But 90k steps with 2 parallel envs may not be enough training budget
4. Compare: Exp 9 (single track, 90k steps, v5) → 2000 steps. Exp 11b
(2 tracks, 90k steps, v6) → 194 steps. The training budget per track
is halved AND the reward is harder to exploit.
**Next experiments to consider:**
- Increase total_timesteps to 180k-250k (restore per-track budget)
- Try v6 reward on single track first to isolate reward vs multi-track effects
- Try v5 reward with parallel envs but longer training (accept some circling)
- Check if efficiency gate triggers too aggressively during normal cornering