docs: Exp10 vs Exp9 vs Wave4 Trial 9 root cause analysis — random seed lottery

This commit is contained in:
Paul Huliganga 2026-04-19 10:29:16 -04:00
parent 3d04b53a86
commit db1274174f
1 changed files with 62 additions and 0 deletions

View File

@ -272,3 +272,65 @@ Goal: model that is reliable on both training tracks, then test generalisation t
**Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log`
### Exp 9 vs Exp 10 — Root Cause Analysis
| Aspect | Exp 9 (worked ✅) | Exp 10 (failed ❌) |
|---|---|---|
| **Tracks** | mountain_track **only** | generated_track + mountain_track (round-robin) |
| **Env setup** | `VecTransposeImage(DummyVecEnv([make_env]))` — created ONCE, never closed | `wrap_env(raw)` passed to PPO, which auto-wraps; **closed and reopened** every 6k steps |
| **Track switching** | None — single env for entire 90k steps | `close_and_switch()` — close env, exit_scene, sleep, gym.make new track |
| **PPO continuity** | Single `model.learn()` calls with `reset_num_timesteps=False`, same env | `model.learn()` + `model.set_env(new_env)` after each switch |
| **Eval between segments** | Direct `env.reset()` + predict loop on same env | Same, but env may be a different track than what was just trained |
| **Best model selection** | Based on eval reward on mountain_track | Based on segment reward — could be from either track |
**Conclusion:** Exp 9 kept a single persistent env connection for all 90k steps.
Exp 10 closed and reopened the env every 6k steps with `model.set_env()`.
This likely disrupts PPO's rollout buffer, value estimates, and observation normalization.
Exp 9 was a completely different (simpler) script with no track switching at all.
### Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work?
Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME
`multitrack_runner.py` code — yet Trial 9 scored 1435 on mini_monaco (zero-shot)
while Exp 10 crashes on every track at <180 steps.
**Wave 4 Trial 9 parameters:**
- lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893
- Trained on generated_track + mountain_track (same as Exp 10)
- Used `multitrack_runner.py` via CLI subprocess (same close_and_switch logic)
**Exp 10 parameters:**
- lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000
- Nearly identical to Trial 9
**But Wave 4 was mostly failures too:**
| Metric | Value |
|---|---|
| Total Wave 4 trials | 25 |
| Scores > 500 | 4 / 25 (16%) |
| Scores > 200 | 5 / 25 (20%) |
| Median score | 111.3 |
| Mean score | 343.8 |
| Std deviation | 566.2 |
The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials
scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch
and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when
re-tested (see STATE.md).
**The real conclusion:** Trial 9's success was likely due to **lucky random
initialization of CNN weights**. With 80% of trials failing under the same
training methodology, the multitrack round-robin approach via close_and_switch
is fundamentally unreliable. The few successes are random seed lottery winners,
not evidence that the method works.
**Wave 5 reproduction attempt:** We tried training on generated_track only
(single track, no switching, same lr=0.000725, 90k steps) to test whether
the track-switching was the problem. Result stored in `models/wave5-gentrack-only/`.
(Results were poor — could not reproduce Trial 9's quality.)
**Open question:** Is there a reliable way to do multi-track training, or
should we focus on single-track training with domain randomization (lighting,
camera angle) to achieve generalization instead?