docs: Exp10 eval results — total failure, crashes on all tracks (massive regression from Exp9/W4T9)

2026-04-19 10:19:16 -04:00 · 2026-04-19 10:19:16 -04:00 · 3d04b53a86
parent 6e9546cd22
commit 3d04b53a86
2 changed files with 70 additions and 0 deletions
--- a/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log
+++ b/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log
@ -0,0 +1,38 @@
+[10:15:15] Model:    models/exp10-two-tracks/best_model.zip
+[10:15:15] Sets:     3
+[10:15:15] Max steps:2000
+[10:15:15] Log file: /home/paulh/projects/donkeycar-rl-autoresearch/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log
+[10:15:15] 
+==================================================
+[10:15:15] SET 1 of 3
+[10:15:15] ==================================================
+[10:15:32]   Set1 mountain_track      :  178 steps     12.1 reward  ❌ crash@178
+[10:15:47]   Set1 generated_track     :   99 steps      7.2 reward  ❌ crash@99
+[10:16:03]   Set1 generated_road      :  135 steps     11.1 reward  ❌ crash@135
+[10:16:18]   Set1 mini_monaco         :  111 steps      5.2 reward  ❌ crash@111
+[10:16:20] 
+==================================================
+[10:16:20] SET 2 of 3
+[10:16:20] ==================================================
+[10:16:34]   Set2 mountain_track      :  179 steps     11.2 reward  ❌ crash@179
+[10:16:49]   Set2 generated_track     :   82 steps      6.1 reward  ❌ crash@82
+[10:17:06]   Set2 generated_road      :  223 steps     29.8 reward  ❌ crash@223
+[10:17:22]   Set2 mini_monaco         :  133 steps      6.4 reward  ❌ crash@133
+[10:17:24] 
+==================================================
+[10:17:24] SET 3 of 3
+[10:17:24] ==================================================
+[10:17:38]   Set3 mountain_track      :  179 steps     11.9 reward  ❌ crash@179
+[10:17:53]   Set3 generated_track     :   88 steps      5.6 reward  ❌ crash@88
+[10:18:08]   Set3 generated_road      :  105 steps      7.0 reward  ❌ crash@105
+[10:18:24]   Set3 mini_monaco         :  129 steps      5.9 reward  ❌ crash@129
+[10:18:26] 
+==================================================
+[10:18:26] SUMMARY (3 sets, max 2000 steps per run)
+[10:18:26] ==================================================
+[10:18:26]   ❌ mountain_track      : 178/179/179  mean=179
+[10:18:26]   ❌ generated_track     : 99/82/88  mean=90
+[10:18:26]   ❌ generated_road      : 135/223/105  mean=154
+[10:18:26]   ❌ mini_monaco         : 111/133/129  mean=124
+[10:18:26] 
+Full log saved to: /home/paulh/projects/donkeycar-rl-autoresearch/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log
--- a/docs/TEST_HISTORY.md
+++ b/docs/TEST_HISTORY.md
@ -240,3 +240,35 @@ Goal: model that is reliable on both training tracks, then test generalisation t
  generated_road improved, mini_monaco TBD
 - **This is essentially Trial 9 repeated with:** v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix

+### Exp 10 — Evaluation Results (3-set test, 2026-04-19)
+
+**Model tested:** `models/exp10-two-tracks/best_model.zip`
+**Result: TOTAL FAILURE — crashes on every track, every set.**
+
+| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
+|---|---|---|---|---|---|
+| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot every time |
+| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes almost immediately |
+| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early |
+| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |
+
+**Comparison to previous best models:**
+- Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. **91% regression.**
+- Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124.
+
+**Analysis:**
+- The round-robin track switching every 6,000 steps via `multitrack_runner.train_multitrack()` 
+  produced a model that learned NEITHER track. This is catastrophic interference.
+- Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different
+  hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in
+  HOW the environment switching works — `multitrack_runner` closes and reopens envs,
+  potentially disrupting PPO's rollout buffer and value function estimates.
+- Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has
+  learned a fixed degenerate policy (always turn one direction) rather than responding to vision.
+
+**Key question:** Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed?
+  Possible causes: (1) env close/reopen resets PPO internal state, (2) `best_model` selection
+  criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller.
+
+**Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log`
+