1.3 KiB

Raw Blame History

Mountain candidate checkpoint evaluation — 2026-04-19

Deterministic eval on mountain_track, 9 episodes per model, max 2000 steps/episode.

model	floor	success eps	full 2k eps	avg laps/ep	total laps	mean lap (all)	best lap	avg steps	notes
exp14_base	0.2	7/9	3/9	1.78	16	29.24	27.02	1332	original champion
ft_006k	0.4	1/9	0/9	0.11	1	21.36	21.36	335	very fast when it works, extremely fragile
ft_024k	0.4	4/9	0/9	0.56	5	21.58	20.53	575	fast but fragile
ft_030k	0.4	1/9	0/9	0.22	2	21.53	20.72	317	very fast but extremely fragile
ft_036k	0.2	9/9	6/9	2.78	25	27.93	26.16	1841	best balance: fastest robust candidate
ft_042k	0.2	8/9	4/9	1.89	17	29.25	27.09	1404	decent, but worse than 36k
ft_048k	0.2	6/9	3/9	1.44	13	31.15	28.31	1127	degraded

Recommendation

Best overall candidate:

models/exp14-mountain-v5-finetune/checkpoint_0036000.zip

Reason:

9/9 successful episodes
25 total laps across 9 episodes
mean lap 27.93s
best lap 26.16s
clearly more robust than the original exp14 champion and later finetune checkpoints

Raw result file

agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl