CRITICAL BUG FIX — Reward Hacking: - Old formula: speed × (1 - cte/max_cte) could be maximized by oscillating at track boundary regardless of on-track behavior (trials 8+13 hit 1936+1139) - New formula: original_reward × (1 + speed_scale × speed) ONLY when on_track - Off-track (original_reward ≤ 0) → zero speed bonus → cannot be hacked - Verified hack-proof: 9 new targeted tests including test_cannot_hack_by_going_fast_off_track Reward Hacking Auto-Detection: - check_for_reward_hacking() flags results with >3.0 reward/step as suspected hacking - Flagged results are excluded from GP fitting (won't optimize toward hacking params) - reward_hacking_suspected field added to JSONL result records Research Documentation: - docs/RESEARCH_LOG.md created: full chronological research history - Random policy bug discovery and impact - Throttle clamp fix - Reward hacking discovery with evidence table - Hack-proof design rationale - Lessons learned + future research questions - Archived corrupted Phase 1 data: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl - Archived hacked models: models/ARCHIVED_reward_hacking/ Clean start: autoresearch_results_phase1.jsonl reset, models/champion reset Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: +9 (reward wrapper hack-proof tests) TypeScript: N/A |
||
|---|---|---|
| .. | ||
| ARCHIVED_reward_hacking/champion_hacked | ||