donkeycar-rl-autoresearch/agent/models/ARCHIVED_reward_hacking/champion_hacked
Paul Huliganga 5e93dae316 fix: hack-proof reward shaping + reward hacking detection + research log
CRITICAL BUG FIX — Reward Hacking:
- Old formula: speed × (1 - cte/max_cte) could be maximized by oscillating
  at track boundary regardless of on-track behavior (trials 8+13 hit 1936+1139)
- New formula: original_reward × (1 + speed_scale × speed) ONLY when on_track
- Off-track (original_reward ≤ 0) → zero speed bonus → cannot be hacked
- Verified hack-proof: 9 new targeted tests including test_cannot_hack_by_going_fast_off_track

Reward Hacking Auto-Detection:
- check_for_reward_hacking() flags results with >3.0 reward/step as suspected hacking
- Flagged results are excluded from GP fitting (won't optimize toward hacking params)
- reward_hacking_suspected field added to JSONL result records

Research Documentation:
- docs/RESEARCH_LOG.md created: full chronological research history
  - Random policy bug discovery and impact
  - Throttle clamp fix
  - Reward hacking discovery with evidence table
  - Hack-proof design rationale
  - Lessons learned + future research questions
- Archived corrupted Phase 1 data: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl
- Archived hacked models: models/ARCHIVED_reward_hacking/

Clean start: autoresearch_results_phase1.jsonl reset, models/champion reset

Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +9 (reward wrapper hack-proof tests)
TypeScript: N/A
2026-04-13 12:27:48 -04:00
..
manifest.json fix: hack-proof reward shaping + reward hacking detection + research log 2026-04-13 12:27:48 -04:00