From 3d04b53a86126f316b27f62384f0e11d67703724 Mon Sep 17 00:00:00 2001 From: Paul Huliganga Date: Sun, 19 Apr 2026 10:19:16 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20Exp10=20eval=20results=20=E2=80=94=20to?= =?UTF-8?q?tal=20failure,=20crashes=20on=20all=20tracks=20(massive=20regre?= =?UTF-8?q?ssion=20from=20Exp9/W4T9)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../2026-04-19_10-15_exp10-two-tracks.log | 38 +++++++++++++++++++ docs/TEST_HISTORY.md | 32 ++++++++++++++++ 2 files changed, 70 insertions(+) create mode 100644 agent/test-results/2026-04-19_10-15_exp10-two-tracks.log diff --git a/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log b/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log new file mode 100644 index 0000000..aea70f7 --- /dev/null +++ b/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log @@ -0,0 +1,38 @@ +[10:15:15] Model: models/exp10-two-tracks/best_model.zip +[10:15:15] Sets: 3 +[10:15:15] Max steps:2000 +[10:15:15] Log file: /home/paulh/projects/donkeycar-rl-autoresearch/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log +[10:15:15] +================================================== +[10:15:15] SET 1 of 3 +[10:15:15] ================================================== +[10:15:32] Set1 mountain_track : 178 steps 12.1 reward ❌ crash@178 +[10:15:47] Set1 generated_track : 99 steps 7.2 reward ❌ crash@99 +[10:16:03] Set1 generated_road : 135 steps 11.1 reward ❌ crash@135 +[10:16:18] Set1 mini_monaco : 111 steps 5.2 reward ❌ crash@111 +[10:16:20] +================================================== +[10:16:20] SET 2 of 3 +[10:16:20] ================================================== +[10:16:34] Set2 mountain_track : 179 steps 11.2 reward ❌ crash@179 +[10:16:49] Set2 generated_track : 82 steps 6.1 reward ❌ crash@82 +[10:17:06] Set2 generated_road : 223 steps 29.8 reward ❌ crash@223 +[10:17:22] Set2 mini_monaco : 133 steps 6.4 reward ❌ crash@133 +[10:17:24] +================================================== +[10:17:24] SET 3 of 3 +[10:17:24] ================================================== +[10:17:38] Set3 mountain_track : 179 steps 11.9 reward ❌ crash@179 +[10:17:53] Set3 generated_track : 88 steps 5.6 reward ❌ crash@88 +[10:18:08] Set3 generated_road : 105 steps 7.0 reward ❌ crash@105 +[10:18:24] Set3 mini_monaco : 129 steps 5.9 reward ❌ crash@129 +[10:18:26] +================================================== +[10:18:26] SUMMARY (3 sets, max 2000 steps per run) +[10:18:26] ================================================== +[10:18:26] ❌ mountain_track : 178/179/179 mean=179 +[10:18:26] ❌ generated_track : 99/82/88 mean=90 +[10:18:26] ❌ generated_road : 135/223/105 mean=154 +[10:18:26] ❌ mini_monaco : 111/133/129 mean=124 +[10:18:26] +Full log saved to: /home/paulh/projects/donkeycar-rl-autoresearch/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log diff --git a/docs/TEST_HISTORY.md b/docs/TEST_HISTORY.md index 22f6f5c..0c30451 100644 --- a/docs/TEST_HISTORY.md +++ b/docs/TEST_HISTORY.md @@ -240,3 +240,35 @@ Goal: model that is reliable on both training tracks, then test generalisation t generated_road improved, mini_monaco TBD - **This is essentially Trial 9 repeated with:** v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix +### Exp 10 — Evaluation Results (3-set test, 2026-04-19) + +**Model tested:** `models/exp10-two-tracks/best_model.zip` +**Result: TOTAL FAILURE — crashes on every track, every set.** + +| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict | +|---|---|---|---|---|---| +| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot every time | +| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes almost immediately | +| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early | +| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early | + +**Comparison to previous best models:** +- Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. **91% regression.** +- Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124. + +**Analysis:** +- The round-robin track switching every 6,000 steps via `multitrack_runner.train_multitrack()` + produced a model that learned NEITHER track. This is catastrophic interference. +- Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different + hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in + HOW the environment switching works — `multitrack_runner` closes and reopens envs, + potentially disrupting PPO's rollout buffer and value function estimates. +- Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has + learned a fixed degenerate policy (always turn one direction) rather than responding to vision. + +**Key question:** Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed? + Possible causes: (1) env close/reopen resets PPO internal state, (2) `best_model` selection + criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller. + +**Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log` +