# Mountain candidate checkpoint evaluation — 2026-04-19 Deterministic eval on `mountain_track`, 9 episodes per model, max 2000 steps/episode. | model | floor | success eps | full 2k eps | avg laps/ep | total laps | mean lap (all) | best lap | avg steps | notes | |---|---:|---:|---:|---:|---:|---:|---:|---:|---| | exp14_base | 0.2 | 7/9 | 3/9 | 1.78 | 16 | 29.24 | 27.02 | 1332 | original champion | | ft_006k | 0.4 | 1/9 | 0/9 | 0.11 | 1 | 21.36 | 21.36 | 335 | very fast when it works, extremely fragile | | ft_024k | 0.4 | 4/9 | 0/9 | 0.56 | 5 | 21.58 | 20.53 | 575 | fast but fragile | | ft_030k | 0.4 | 1/9 | 0/9 | 0.22 | 2 | 21.53 | 20.72 | 317 | very fast but extremely fragile | | ft_036k | 0.2 | 9/9 | 6/9 | 2.78 | 25 | 27.93 | 26.16 | 1841 | best balance: fastest robust candidate | | ft_042k | 0.2 | 8/9 | 4/9 | 1.89 | 17 | 29.25 | 27.09 | 1404 | decent, but worse than 36k | | ft_048k | 0.2 | 6/9 | 3/9 | 1.44 | 13 | 31.15 | 28.31 | 1127 | degraded | ## Recommendation Best overall candidate: - `models/exp14-mountain-v5-finetune/checkpoint_0036000.zip` Reason: - 9/9 successful episodes - 25 total laps across 9 episodes - mean lap 27.93s - best lap 26.16s - clearly more robust than the original exp14 champion and later finetune checkpoints ## Raw result file - `agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl`