30 lines
1.3 KiB
Markdown
30 lines
1.3 KiB
Markdown
# Mountain candidate checkpoint evaluation — 2026-04-19
|
|
|
|
Deterministic eval on `mountain_track`, 9 episodes per model, max 2000 steps/episode.
|
|
|
|
| model | floor | success eps | full 2k eps | avg laps/ep | total laps | mean lap (all) | best lap | avg steps | notes |
|
|
|---|---:|---:|---:|---:|---:|---:|---:|---:|---|
|
|
| exp14_base | 0.2 | 7/9 | 3/9 | 1.78 | 16 | 29.24 | 27.02 | 1332 | original champion |
|
|
| ft_006k | 0.4 | 1/9 | 0/9 | 0.11 | 1 | 21.36 | 21.36 | 335 | very fast when it works, extremely fragile |
|
|
| ft_024k | 0.4 | 4/9 | 0/9 | 0.56 | 5 | 21.58 | 20.53 | 575 | fast but fragile |
|
|
| ft_030k | 0.4 | 1/9 | 0/9 | 0.22 | 2 | 21.53 | 20.72 | 317 | very fast but extremely fragile |
|
|
| ft_036k | 0.2 | 9/9 | 6/9 | 2.78 | 25 | 27.93 | 26.16 | 1841 | best balance: fastest robust candidate |
|
|
| ft_042k | 0.2 | 8/9 | 4/9 | 1.89 | 17 | 29.25 | 27.09 | 1404 | decent, but worse than 36k |
|
|
| ft_048k | 0.2 | 6/9 | 3/9 | 1.44 | 13 | 31.15 | 28.31 | 1127 | degraded |
|
|
|
|
## Recommendation
|
|
|
|
Best overall candidate:
|
|
- `models/exp14-mountain-v5-finetune/checkpoint_0036000.zip`
|
|
|
|
Reason:
|
|
- 9/9 successful episodes
|
|
- 25 total laps across 9 episodes
|
|
- mean lap 27.93s
|
|
- best lap 26.16s
|
|
- clearly more robust than the original exp14 champion and later finetune checkpoints
|
|
|
|
## Raw result file
|
|
|
|
- `agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl`
|