docs: Exp10 eval results — total failure, crashes on all tracks (massive regression from Exp9/W4T9)

This commit is contained in:
Paul Huliganga 2026-04-19 10:19:16 -04:00
parent 6e9546cd22
commit 3d04b53a86
2 changed files with 70 additions and 0 deletions

View File

@ -0,0 +1,38 @@
[10:15:15] Model: models/exp10-two-tracks/best_model.zip
[10:15:15] Sets: 3
[10:15:15] Max steps:2000
[10:15:15] Log file: /home/paulh/projects/donkeycar-rl-autoresearch/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log
[10:15:15]
==================================================
[10:15:15] SET 1 of 3
[10:15:15] ==================================================
[10:15:32] Set1 mountain_track : 178 steps 12.1 reward ❌ crash@178
[10:15:47] Set1 generated_track : 99 steps 7.2 reward ❌ crash@99
[10:16:03] Set1 generated_road : 135 steps 11.1 reward ❌ crash@135
[10:16:18] Set1 mini_monaco : 111 steps 5.2 reward ❌ crash@111
[10:16:20]
==================================================
[10:16:20] SET 2 of 3
[10:16:20] ==================================================
[10:16:34] Set2 mountain_track : 179 steps 11.2 reward ❌ crash@179
[10:16:49] Set2 generated_track : 82 steps 6.1 reward ❌ crash@82
[10:17:06] Set2 generated_road : 223 steps 29.8 reward ❌ crash@223
[10:17:22] Set2 mini_monaco : 133 steps 6.4 reward ❌ crash@133
[10:17:24]
==================================================
[10:17:24] SET 3 of 3
[10:17:24] ==================================================
[10:17:38] Set3 mountain_track : 179 steps 11.9 reward ❌ crash@179
[10:17:53] Set3 generated_track : 88 steps 5.6 reward ❌ crash@88
[10:18:08] Set3 generated_road : 105 steps 7.0 reward ❌ crash@105
[10:18:24] Set3 mini_monaco : 129 steps 5.9 reward ❌ crash@129
[10:18:26]
==================================================
[10:18:26] SUMMARY (3 sets, max 2000 steps per run)
[10:18:26] ==================================================
[10:18:26] ❌ mountain_track : 178/179/179 mean=179
[10:18:26] ❌ generated_track : 99/82/88 mean=90
[10:18:26] ❌ generated_road : 135/223/105 mean=154
[10:18:26] ❌ mini_monaco : 111/133/129 mean=124
[10:18:26]
Full log saved to: /home/paulh/projects/donkeycar-rl-autoresearch/agent/test-results/2026-04-19_10-15_exp10-two-tracks.log

View File

@ -240,3 +240,35 @@ Goal: model that is reliable on both training tracks, then test generalisation t
generated_road improved, mini_monaco TBD
- **This is essentially Trial 9 repeated with:** v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix
### Exp 10 — Evaluation Results (3-set test, 2026-04-19)
**Model tested:** `models/exp10-two-tracks/best_model.zip`
**Result: TOTAL FAILURE — crashes on every track, every set.**
| Track | Set 1 | Set 2 | Set 3 | Mean | Verdict |
|---|---|---|---|---|---|
| mountain_track (trained) | 178 | 179 | 179 | **179** | ❌ Crashes at same spot every time |
| generated_track (trained) | 99 | 82 | 88 | **90** | ❌ Crashes almost immediately |
| generated_road (zero-shot) | 135 | 223 | 105 | **154** | ❌ Crashes early |
| mini_monaco (zero-shot) | 111 | 133 | 129 | **124** | ❌ Crashes early |
**Comparison to previous best models:**
- Exp 9 (mountain only): mountain_track was 2000/2000 every time → now 179. **91% regression.**
- Wave 4 Trial 9 (generated_track + mountain_track via autoresearch): generated_track 2000/2000, mini_monaco 2000/2000 → now 90 and 124.
**Analysis:**
- The round-robin track switching every 6,000 steps via `multitrack_runner.train_multitrack()`
produced a model that learned NEITHER track. This is catastrophic interference.
- Wave 4 Trial 9 used the same two tracks but via the autoresearch controller with different
hyperparameters (switch=6,851, lr=0.000725, 90k steps). The key difference is likely in
HOW the environment switching works — `multitrack_runner` closes and reopens envs,
potentially disrupting PPO's rollout buffer and value function estimates.
- Mountain_track crashes at exactly step 178-179 in all 3 sets — suggests the model has
learned a fixed degenerate policy (always turn one direction) rather than responding to vision.
**Key question:** Why did Wave 4 Trial 9 succeed with similar parameters but Exp 10 failed?
Possible causes: (1) env close/reopen resets PPO internal state, (2) `best_model` selection
criteria differs, (3) multitrack_runner wrapping chain differs from autoresearch controller.
**Full log:** `agent/test-results/2026-04-19_10-15_exp10-two-tracks.log`