donkeycar-rl-autoresearch/agent/outerloop-results/mountain_candidate_eval_202...

# Mountain candidate checkpoint evaluation — 2026-04-19

Deterministic eval on `mountain_track`, 9 episodes per model, max 2000 steps/episode.

| model | floor | success eps | full 2k eps | avg laps/ep | total laps | mean lap (all) | best lap | avg steps | notes |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---|
| exp14_base | 0.2 | 7/9 | 3/9 | 1.78 | 16 | 29.24 | 27.02 | 1332 | original champion |
| ft_006k | 0.4 | 1/9 | 0/9 | 0.11 | 1 | 21.36 | 21.36 | 335 | very fast when it works, extremely fragile |
| ft_024k | 0.4 | 4/9 | 0/9 | 0.56 | 5 | 21.58 | 20.53 | 575 | fast but fragile |
| ft_030k | 0.4 | 1/9 | 0/9 | 0.22 | 2 | 21.53 | 20.72 | 317 | very fast but extremely fragile |
| ft_036k | 0.2 | 9/9 | 6/9 | 2.78 | 25 | 27.93 | 26.16 | 1841 | best balance: fastest robust candidate |
| ft_042k | 0.2 | 8/9 | 4/9 | 1.89 | 17 | 29.25 | 27.09 | 1404 | decent, but worse than 36k |
| ft_048k | 0.2 | 6/9 | 3/9 | 1.44 | 13 | 31.15 | 28.31 | 1127 | degraded |

## Recommendation

Best overall candidate:
- `models/exp14-mountain-v5-finetune/checkpoint_0036000.zip`

Reason:
- 9/9 successful episodes
- 25 total laps across 9 episodes
- mean lap 27.93s
- best lap 26.16s
- clearly more robust than the original exp14 champion and later finetune checkpoints

## Raw result file

- `agent/outerloop-results/mountain_candidate_eval_2026-04-19.jsonl`