From db1274174fb6cad21cd966daa4687e5e8013ce5a Mon Sep 17 00:00:00 2001
From: Paul Huliganga <paje0101@gmail.com>
Date: Sun, 19 Apr 2026 10:29:16 -0400
Subject: [PATCH] =?UTF-8?q?docs:=20Exp10=20vs=20Exp9=20vs=20Wave4=20Trial?=
 =?UTF-8?q?=209=20root=20cause=20analysis=20=E2=80=94=20random=20seed=20lo?=
 =?UTF-8?q?ttery?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/TEST_HISTORY.md | 62 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/docs/TEST_HISTORY.md b/docs/TEST_HISTORY.md
index 0c30451..c63440e 100644
--- a/docs/TEST_HISTORY.md
+++ b/docs/TEST_HISTORY.md
@@ -272,3 +272,65 @@ Goal: model that is reliable on both training tracks, then test generalisation t
 
 **Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log`
 
+### Exp 9 vs Exp 10 — Root Cause Analysis
+
+| Aspect | Exp 9 (worked ✅) | Exp 10 (failed ❌) |
+|---|---|---|
+| **Tracks** | mountain_track **only** | generated_track + mountain_track (round-robin) |
+| **Env setup** | `VecTransposeImage(DummyVecEnv([make_env]))` — created ONCE, never closed | `wrap_env(raw)` passed to PPO, which auto-wraps; **closed and reopened** every 6k steps |
+| **Track switching** | None — single env for entire 90k steps | `close_and_switch()` — close env, exit_scene, sleep, gym.make new track |
+| **PPO continuity** | Single `model.learn()` calls with `reset_num_timesteps=False`, same env | `model.learn()` + `model.set_env(new_env)` after each switch |
+| **Eval between segments** | Direct `env.reset()` + predict loop on same env | Same, but env may be a different track than what was just trained |
+| **Best model selection** | Based on eval reward on mountain_track | Based on segment reward — could be from either track |
+
+**Conclusion:** Exp 9 kept a single persistent env connection for all 90k steps.
+Exp 10 closed and reopened the env every 6k steps with `model.set_env()`.
+This likely disrupts PPO's rollout buffer, value estimates, and observation normalization.
+Exp 9 was a completely different (simpler) script with no track switching at all.
+
+### Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work?
+
+Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME
+`multitrack_runner.py` code — yet Trial 9 scored 1435 on mini_monaco (zero-shot)
+while Exp 10 crashes on every track at <180 steps.
+
+**Wave 4 Trial 9 parameters:**
+- lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893
+- Trained on generated_track + mountain_track (same as Exp 10)
+- Used `multitrack_runner.py` via CLI subprocess (same close_and_switch logic)
+
+**Exp 10 parameters:**
+- lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000
+- Nearly identical to Trial 9
+
+**But Wave 4 was mostly failures too:**
+
+| Metric | Value |
+|---|---|
+| Total Wave 4 trials | 25 |
+| Scores > 500 | 4 / 25 (16%) |
+| Scores > 200 | 5 / 25 (20%) |
+| Median score | 111.3 |
+| Mean score | 343.8 |
+| Std deviation | 566.2 |
+
+The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials
+scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch
+and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when
+re-tested (see STATE.md).
+
+**The real conclusion:** Trial 9's success was likely due to **lucky random
+initialization of CNN weights**. With 80% of trials failing under the same
+training methodology, the multitrack round-robin approach via close_and_switch
+is fundamentally unreliable. The few successes are random seed lottery winners,
+not evidence that the method works.
+
+**Wave 5 reproduction attempt:** We tried training on generated_track only
+(single track, no switching, same lr=0.000725, 90k steps) to test whether
+the track-switching was the problem. Result stored in `models/wave5-gentrack-only/`.
+(Results were poor — could not reproduce Trial 9's quality.)
+
+**Open question:** Is there a reliable way to do multi-track training, or
+should we focus on single-track training with domain randomization (lighting,
+camera angle) to achieve generalization instead?
+