eval_best_models.py: evaluates exp24/25/26 best models across 10 fixed random
roads (regen_road with fixed seeds) for fair head-to-head comparison.
eval_gentrack_on_minimonaco.py: zero-shot evaluation of gentrack specialists
(exp13, wave5-gentrack-only, wave4-trial-0009) on mini-monaco.
Results: exp26 > exp25 > exp24 on random roads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>