docs: Exp10 vs Exp9 vs Wave4 Trial 9 root cause analysis — random seed lottery

2026-04-19 10:29:16 -04:00 · 2026-04-19 10:29:16 -04:00 · db1274174f
parent 3d04b53a86
commit db1274174f
1 changed files with 62 additions and 0 deletions
--- a/docs/TEST_HISTORY.md
+++ b/docs/TEST_HISTORY.md
@ -272,3 +272,65 @@ Goal: model that is reliable on both training tracks, then test generalisation t

 **Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log`

+### Exp 9 vs Exp 10 — Root Cause Analysis
+
+| Aspect | Exp 9 (worked ✅) | Exp 10 (failed ❌) |
+|---|---|---|
+| **Tracks** | mountain_track **only** | generated_track + mountain_track (round-robin) |
+| **Env setup** | `VecTransposeImage(DummyVecEnv([make_env]))` — created ONCE, never closed | `wrap_env(raw)` passed to PPO, which auto-wraps; **closed and reopened** every 6k steps |
+| **Track switching** | None — single env for entire 90k steps | `close_and_switch()` — close env, exit_scene, sleep, gym.make new track |
+| **PPO continuity** | Single `model.learn()` calls with `reset_num_timesteps=False`, same env | `model.learn()` + `model.set_env(new_env)` after each switch |
+| **Eval between segments** | Direct `env.reset()` + predict loop on same env | Same, but env may be a different track than what was just trained |
+| **Best model selection** | Based on eval reward on mountain_track | Based on segment reward — could be from either track |
+
+**Conclusion:** Exp 9 kept a single persistent env connection for all 90k steps.
+Exp 10 closed and reopened the env every 6k steps with `model.set_env()`.
+This likely disrupts PPO's rollout buffer, value estimates, and observation normalization.
+Exp 9 was a completely different (simpler) script with no track switching at all.
+
+### Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work?
+
+Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME
+`multitrack_runner.py` code — yet Trial 9 scored 1435 on mini_monaco (zero-shot)
+while Exp 10 crashes on every track at <180 steps.
+
+**Wave 4 Trial 9 parameters:**
+- lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893
+- Trained on generated_track + mountain_track (same as Exp 10)
+- Used `multitrack_runner.py` via CLI subprocess (same close_and_switch logic)
+
+**Exp 10 parameters:**
+- lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000
+- Nearly identical to Trial 9
+
+**But Wave 4 was mostly failures too:**
+
+| Metric | Value |
+|---|---|
+| Total Wave 4 trials | 25 |
+| Scores > 500 | 4 / 25 (16%) |
+| Scores > 200 | 5 / 25 (20%) |
+| Median score | 111.3 |
+| Mean score | 343.8 |
+| Std deviation | 566.2 |
+
+The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials
+scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch
+and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when
+re-tested (see STATE.md).
+
+**The real conclusion:** Trial 9's success was likely due to **lucky random
+initialization of CNN weights**. With 80% of trials failing under the same
+training methodology, the multitrack round-robin approach via close_and_switch
+is fundamentally unreliable. The few successes are random seed lottery winners,
+not evidence that the method works.
+
+**Wave 5 reproduction attempt:** We tried training on generated_track only
+(single track, no switching, same lr=0.000725, 90k steps) to test whether
+the track-switching was the problem. Result stored in `models/wave5-gentrack-only/`.
+(Results were poor — could not reproduce Trial 9's quality.)
+
+**Open question:** Is there a reliable way to do multi-track training, or
+should we focus on single-track training with domain randomization (lighting,
+camera angle) to achieve generalization instead?
+