Commit Graph

1 Commits

Author SHA1 Message Date
Paul Huliganga 0615b22cb9 feat(eval): cross-model evaluation scripts for exp24/25/26 + gentrack→minimonaco
eval_best_models.py: evaluates exp24/25/26 best models across 10 fixed random
roads (regen_road with fixed seeds) for fair head-to-head comparison.
eval_gentrack_on_minimonaco.py: zero-shot evaluation of gentrack specialists
(exp13, wave5-gentrack-only, wave4-trial-0009) on mini-monaco.

Results: exp26 > exp25 > exp24 on random roads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 15:32:21 -04:00