From db1274174fb6cad21cd966daa4687e5e8013ce5a Mon Sep 17 00:00:00 2001 From: Paul Huliganga Date: Sun, 19 Apr 2026 10:29:16 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20Exp10=20vs=20Exp9=20vs=20Wave4=20Trial?= =?UTF-8?q?=209=20root=20cause=20analysis=20=E2=80=94=20random=20seed=20lo?= =?UTF-8?q?ttery?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/TEST_HISTORY.md | 62 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/docs/TEST_HISTORY.md b/docs/TEST_HISTORY.md index 0c30451..c63440e 100644 --- a/docs/TEST_HISTORY.md +++ b/docs/TEST_HISTORY.md @@ -272,3 +272,65 @@ Goal: model that is reliable on both training tracks, then test generalisation t **Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log` +### Exp 9 vs Exp 10 — Root Cause Analysis + +| Aspect | Exp 9 (worked ✅) | Exp 10 (failed ❌) | +|---|---|---| +| **Tracks** | mountain_track **only** | generated_track + mountain_track (round-robin) | +| **Env setup** | `VecTransposeImage(DummyVecEnv([make_env]))` — created ONCE, never closed | `wrap_env(raw)` passed to PPO, which auto-wraps; **closed and reopened** every 6k steps | +| **Track switching** | None — single env for entire 90k steps | `close_and_switch()` — close env, exit_scene, sleep, gym.make new track | +| **PPO continuity** | Single `model.learn()` calls with `reset_num_timesteps=False`, same env | `model.learn()` + `model.set_env(new_env)` after each switch | +| **Eval between segments** | Direct `env.reset()` + predict loop on same env | Same, but env may be a different track than what was just trained | +| **Best model selection** | Based on eval reward on mountain_track | Based on segment reward — could be from either track | + +**Conclusion:** Exp 9 kept a single persistent env connection for all 90k steps. +Exp 10 closed and reopened the env every 6k steps with `model.set_env()`. +This likely disrupts PPO's rollout buffer, value estimates, and observation normalization. +Exp 9 was a completely different (simpler) script with no track switching at all. + +### Exp 10 vs Wave 4 Trial 9 — Why Did Trial 9 Work? + +Wave 4 Trial 9 used nearly identical hyperparameters to Exp 10 and the SAME +`multitrack_runner.py` code — yet Trial 9 scored 1435 on mini_monaco (zero-shot) +while Exp 10 crashes on every track at <180 steps. + +**Wave 4 Trial 9 parameters:** +- lr=0.000725, steps_per_switch=6,851, total_timesteps=89,893 +- Trained on generated_track + mountain_track (same as Exp 10) +- Used `multitrack_runner.py` via CLI subprocess (same close_and_switch logic) + +**Exp 10 parameters:** +- lr=0.000725, steps_per_switch=6,000, total_timesteps=90,000 +- Nearly identical to Trial 9 + +**But Wave 4 was mostly failures too:** + +| Metric | Value | +|---|---| +| Total Wave 4 trials | 25 | +| Scores > 500 | 4 / 25 (16%) | +| Scores > 200 | 5 / 25 (20%) | +| Median score | 111.3 | +| Mean score | 343.8 | +| Std deviation | 566.2 | + +The top 4 scores (1943, 1573, 1543, 1435) are massive outliers — 80% of trials +scored below 200. Trial 0 (score 1943) was later found to be pre-exploit-patch +and Trial 14 (1573) and Trial 25 (1543) showed inconsistent driving when +re-tested (see STATE.md). + +**The real conclusion:** Trial 9's success was likely due to **lucky random +initialization of CNN weights**. With 80% of trials failing under the same +training methodology, the multitrack round-robin approach via close_and_switch +is fundamentally unreliable. The few successes are random seed lottery winners, +not evidence that the method works. + +**Wave 5 reproduction attempt:** We tried training on generated_track only +(single track, no switching, same lr=0.000725, 90k steps) to test whether +the track-switching was the problem. Result stored in `models/wave5-gentrack-only/`. +(Results were poor — could not reproduce Trial 9's quality.) + +**Open question:** Is there a reliable way to do multi-track training, or +should we focus on single-track training with domain randomization (lighting, +camera angle) to achieve generalization instead? +