Paul Huliganga
caf91c9fe6
autoresearch: phase1 trial 10 results
...
Agent: pi
Tests: N/A
Tests-Added: 0
TypeScript: N/A
2026-04-13 16:00:23 -04:00
Paul Huliganga
87cff0c9b7
autoresearch: phase1 trial 40 results
...
Agent: pi
Tests: N/A
Tests-Added: 0
TypeScript: N/A
2026-04-13 15:28:05 -04:00
Paul Huliganga
1734e1359e
autoresearch: phase1 trial 30 results
...
Agent: pi
Tests: N/A
Tests-Added: 0
TypeScript: N/A
2026-04-13 15:13:21 -04:00
Paul Huliganga
362c616457
autoresearch: phase1 trial 20 results
...
Agent: pi
Tests: N/A
Tests-Added: 0
TypeScript: N/A
2026-04-13 14:41:55 -04:00
Paul Huliganga
cdb7b80494
autoresearch: phase1 trial 10 results
...
Agent: pi
Tests: N/A
Tests-Added: 0
TypeScript: N/A
2026-04-13 14:07:58 -04:00
Paul Huliganga
fcb6ea1ac2
fix: path-efficiency reward (v3) defeats circular driving exploit
...
CRITICAL BUG FIX — Circular Driving:
- v2 reward still hackable: car circles at starting line with low CTE + positive speed
- Confirmed in data: trial 5 mean_reward=4582, cv=0.0% (physically impossible for genuine driving)
- Statistical signature: cv <1% with high reward = consistent exploit, not genuine driving
ROOT CAUSE: Neither CTE nor raw speed can distinguish forward vs circular motion.
Both have: low CTE (on centerline) + positive speed (moving) = same reward.
Missing dimension: TRACK PROGRESS (net advance along track)
FIX — Path Efficiency Reward (v3):
efficiency = net_displacement / total_path_length (sliding window of 30 steps)
shaped = original x (1 + speed_scale x speed x efficiency)
- Forward driving: efficiency ≈ 1.0 → full speed bonus
- Circular driving: efficiency ≈ 0.0 → speed bonus disappears
- Cannot be hacked: circling means returning to same positions (low net_displacement)
Tests:
- test_efficiency_near_zero_for_circular_driving: confirmed <0.2 efficiency for circles
- test_efficiency_near_one_for_straight_driving: confirmed >0.90 for straight line
- test_straight_driving_gets_higher_reward_than_circular: KEY guarantee
- test_speed_bonus_disappears_when_circling: bonus suppressed after window fills
Research documentation:
- Full analysis with data table added to docs/RESEARCH_LOG.md
- cv% identified as reward hacking indicator
- Archived circular data + models
Clean start: new autoresearch_results_phase1.jsonl, new champion dir
Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +6 (path efficiency, anti-circular)
TypeScript: N/A
2026-04-13 13:36:17 -04:00
Paul Huliganga
d25bc71008
autoresearch: phase1 trial 10 results
...
Agent: pi
Tests: N/A
Tests-Added: 0
TypeScript: N/A
2026-04-13 13:11:06 -04:00
Paul Huliganga
5e93dae316
fix: hack-proof reward shaping + reward hacking detection + research log
...
CRITICAL BUG FIX — Reward Hacking:
- Old formula: speed × (1 - cte/max_cte) could be maximized by oscillating
at track boundary regardless of on-track behavior (trials 8+13 hit 1936+1139)
- New formula: original_reward × (1 + speed_scale × speed) ONLY when on_track
- Off-track (original_reward ≤ 0) → zero speed bonus → cannot be hacked
- Verified hack-proof: 9 new targeted tests including test_cannot_hack_by_going_fast_off_track
Reward Hacking Auto-Detection:
- check_for_reward_hacking() flags results with >3.0 reward/step as suspected hacking
- Flagged results are excluded from GP fitting (won't optimize toward hacking params)
- reward_hacking_suspected field added to JSONL result records
Research Documentation:
- docs/RESEARCH_LOG.md created: full chronological research history
- Random policy bug discovery and impact
- Throttle clamp fix
- Reward hacking discovery with evidence table
- Hack-proof design rationale
- Lessons learned + future research questions
- Archived corrupted Phase 1 data: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl
- Archived hacked models: models/ARCHIVED_reward_hacking/
Clean start: autoresearch_results_phase1.jsonl reset, models/champion reset
Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +9 (reward wrapper hack-proof tests)
TypeScript: N/A
2026-04-13 12:27:48 -04:00
Paul Huliganga
0c6263352b
autoresearch: phase1 trial 10 results
...
Agent: pi
Tests: N/A
Tests-Added: 0
TypeScript: N/A
2026-04-13 12:01:17 -04:00
Paul Huliganga
8c9fd76c68
fix: reduce timesteps to 1k-5k for Phase 1 CPU training; add sim health/stuck detection; fix PPO throttle clamp
...
Problems fixed:
- Timesteps 5k-30k caused all trials to timeout (PPO+CNN+CPU needs ~0.1s/step)
- New range: 1000-5000 steps fits well within 480s timeout
- PPO random init policy outputs throttle~0 -> car sits still -> fix with ThrottleClampWrapper (min 0.2)
- Sim stuck detection: if speed<0.02 for 100 consecutive steps, stop training and report error
- Sim frozen detection: if observation unchanged for 30 steps, stop training (connection lost)
- eval_episodes reduced to 3 to speed up evaluation phase
Agent: pi/claude-sonnet
Tests: 37/37 passing
Tests-Added: 0 (behaviour change only)
TypeScript: N/A
2026-04-13 11:17:08 -04:00