donkeycar-rl-autoresearch/agent/outerloop-results/mountain_candidate_eval_202...

1.3 KiB

Mountain candidate checkpoint evaluation — 2026-04-19

Deterministic eval on mountain_track, 9 episodes per model, max 2000 steps/episode.

model floor success eps full 2k eps avg laps/ep total laps mean lap (all) best lap avg steps notes
exp14_base 0.2 7/9 3/9 1.78 16 29.24 27.02 1332 original champion
ft_006k 0.4 1/9 0/9 0.11 1 21.36 21.36 335 very fast when it works, extremely fragile
ft_024k 0.4 4/9 0/9 0.56 5 21.58 20.53 575 fast but fragile
ft_030k 0.4 1/9 0/9 0.22 2 21.53 20.72 317 very fast but extremely fragile
ft_036k 0.2 9/9 6/9 2.78 25 27.93 26.16 1841 best balance: fastest robust candidate
ft_042k 0.2 8/9 4/9 1.89 17 29.25 27.09 1404 decent, but worse than 36k
ft_048k 0.2 6/9 3/9 1.44 13 31.15 28.31 1127 degraded

Recommendation

Best overall candidate:

  • models/exp14-mountain-v5-finetune/checkpoint_0036000.zip

Reason:

  • 9/9 successful episodes
  • 25 total laps across 9 episodes
  • mean lap 27.93s
  • best lap 26.16s
  • clearly more robust than the original exp14 champion and later finetune checkpoints

Raw result file

  • agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl