Commit Graph

13 Commits

Author SHA1 Message Date
Paul Huliganga 138c65270f feat(exp22): add solid-hit/wedge/high-CTE exploit fixes and generated-pair warm experiments
- reward_wrapper: detect barrier/wall/tree solid hits, terminate on head-on impact
  or 4 sustained solid-hit frames; prevents car wedging against invisible barriers
- reward_wrapper: add low-speed/wedge termination — kills episode when car is pinned
  motionless (below threshold, no displacement) after grace period
- reward_wrapper: high-CTE exploit fix — return -0.25 immediately when CTE >
  max_cte_terminate (not after patience), so PPO cannot collect positive speed
  rewards while driving the large outside-road circle
- tests: 23 passing unit tests covering all new termination paths
- exp20/21/22: add parallel DummyVecEnv experiments on generated_road+generated_track
  with warm-start from champion model; exp22 is current active run
- SESSION_HANDOFF.md: live handoff doc for next session continuity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 14:46:13 -04:00
Paul Huliganga 9ffe1c5d40 fix: efficiency gate now TERMINATES after 20 low-efficiency steps (was zero-reward only)
Previously circles ran 20+ seconds because the efficiency gate only returned
0 reward without terminating. After 20 consecutive steps of efficiency < 0.15
(~0.7 seconds at 27 steps/sec), episode now terminates with -1.0.

Also confirmed from telemetry diagnostic: CTE does report correctly when
car goes off-track (rises steadily to 6.2m before tree collision).
The grass exploit runs long only when the open grass area has no obstacles.
Efficiency gate termination is the most reliable catch for both circles
and open-grass driving (straight-line grass = high efficiency, but
active_node progress terminator catches that case).
2026-04-19 17:26:38 -04:00
Paul Huliganga 813f888502 fix: reward v6.1 — active_node progress terminator kills circle/stuck exploits
User's insight: a circling car stays near the same track waypoints, so
active_node (sim's track progress indicator) never advances. Track the
maximum active_node reached this episode. If it hasn't increased in
progress_patience=60 steps (~3.3s), terminate.

This catches:
  - Circular driving (active_node oscillates, max never increases)
  - Stuck on cone/barrier (active_node frozen)
  - NOT triggered by: legitimate cornering, slow forward progress, lap resets

On lap completion, active_node wraps to 0 — reset max_node_seen and counter.

Also: Exp 12 — single track mountain training with lap-based stopping criterion.
Train until 3 consecutive laps in eval, not fixed step count.
2026-04-19 17:01:41 -04:00
Paul Huliganga e95c33c1bf fix: reward v6.1 — grass exploit only (CTE patience terminator)
Removed the progress_patience (active_node) terminator that was added
without sufficient evidence. Per ADR-020, mountain rollback is a learning
issue not a termination issue. Removed code should not be re-added without
specific evidence it is needed.

Only confirmed fix: CTE patience terminator catches grass exploit BEFORE
CTE exceeds 16m (the sim's determine_episode_over pass threshold).
  - max_cte_terminate=4.0m
  - cte_patience=20 steps
2026-04-19 16:15:39 -04:00
Paul Huliganga f730a2e0ba docs: ADR-020/021 + session log — throttle/hill history and grass exploit root cause
Critical facts documented permanently:
- throttle_min=0.5 bakes into action space (too fast for corners)
- throttle_min=0.2 + v5 reward CAN learn hill (proved Exp 9, mountain only 90k)
- Mountain failure in parallel is contamination from grass exploit, not throttle
- Grass exploit root cause: sim determine_episode_over() passes when CTE>16m
- DO NOT confuse mountain rollback with stuck issue
- DO NOT change throttle_min as first response to mountain failure
2026-04-19 16:14:28 -04:00
Paul Huliganga beb04f3ebe fix: reward v6 — efficiency gate prevents circular driving, stuck_steps 80→40
v5 dropped the efficiency term to get gradient signal on hills, but this
re-enabled circular driving (observed in Exp 11). v6 adds efficiency back
as a GATE (not multiplier): if efficiency < 0.15, reward = 0. Otherwise
reward = speed × CTE_quality (same as v5).

Gate vs multiplier: v4 used efficiency as a multiplier which killed gradient
on hills (all terms → 0 simultaneously). v6's gate passes when efficiency
is above threshold (car moving forward, even slowly on hill) and only
blocks when car is truly circling.

Also reduced stuck_steps from 80 to 40 (~2.5s vs ~5s) — user reported
car stuck against barriers for ~10s which is too long with DummyVecEnv.
2026-04-19 12:02:55 -04:00
Paul Huliganga 47d8e5b346 fix: short-lap exploit now TERMINATES the episode, not just penalises
The circle exploit persisted because the penalty alone (-100 per short
lap) was insufficient. The model stayed alive between laps accumulating
small positive rewards, making circling a viable strategy despite the
penalty.

Fix: _compute_reward_and_done() returns (reward, force_terminate).
When a short lap is detected, force_terminate=True is returned and
step() sets terminated=True immediately. The episode ends on the spot —
no more rewards possible. This makes the circle exploit strictly worse
than any forward driving behaviour.

Tests updated: _compute_reward → _compute_reward_and_done, short-lap
test now asserts force_terminate=True.

Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
2026-04-18 10:42:23 -04:00
Paul Huliganga b8a13dea81 feat: v5 reward — speed × CTE-quality, drop efficiency term
Problem with v4 on mountain_track: CTE × efficiency × speed all collapse
to zero simultaneously when the car slows on the hill, giving no gradient
signal for 'apply more throttle'.

v5: reward = (speed / 10) × (1 - |CTE| / max_cte)
- Directly rewards going fast while staying centred
- Hill: car slows → reward drops → clear gradient toward more throttle
- Circling protection now entirely handled by lap-time penalty +
  StuckTerminationWrapper (not by the reward formula)

Tests updated to reflect v5 semantics (102 passing).

Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
2026-04-17 13:25:38 -04:00
Paul Huliganga 5d1227833d fix: close short-lap circle exploit and cap segment eval episode length
Two reward hacking behaviours observed during Wave 4 training:

1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
   Model circles at start/finish line completing laps in 1-2 sim-seconds,
   accumulating lap_count indefinitely with no genuine track progress.
   Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
   < min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
   A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
   Window size also increased from 30 → 60 to catch slower circles.

2. Non-terminating segment eval episodes:
   evaluate_policy on wide tracks (no barriers) could run indefinitely,
   inflating segment_reward to 200k+. Replaced with manual eval loop
   capped at MAX_EVAL_STEPS=3000 steps.

Phase 4 results cleared (trials 4-6 ran with exploitable reward).

Tests: 4 new reward wrapper tests, 100 total passing.

Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
2026-04-15 09:06:25 -04:00
Paul Huliganga c8a495dd22 fix: reward v4 — full sim bypass kills circular driving at root
ROOT CAUSE:
  donkey_sim.py calc_reward() uses forward_vel = dot(heading, velocity).
  A spinning car ALWAYS has forward_vel > 0 (always moving 'forward' relative
  to its own heading), so it earned positive reward indefinitely while circling.

v3 WAS INSUFFICIENT:
  v3 applied efficiency only to the speed BONUS: original × (1 + speed×eff×scale)
  But 'original' from sim was still exploitable: CTE≈0 while spinning → original=1.0/step
  Efficiency killed the speed bonus but not the base reward.
  47k-step run: spinning = 1.0/step × 47k = 47k reward (never crashes in circle)

v4 FIX — base × efficiency × speed:
  reward = (1 - abs(cte)/max_cte) × efficiency × (1 + speed_scale × speed)
  Completely ignores sim's bogus forward_vel reward.
  Spinning (eff≈0): reward ≈ 0 regardless of CTE or speed.
  ALL three terms must be high to earn reward — cannot be gamed.

Key new test: test_circling_at_zero_cte_gives_near_zero_reward
  Worst-case exploit (CTE=0 spinning) → avg reward < 0.15 (was 1.0 in v3)
  forward_beats_circling_by_3x confirmed.

Also: update Phase 2 autoresearch timesteps test, research log updated.

Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +1 (core v4 circling guarantee)
TypeScript: N/A
2026-04-13 20:56:32 -04:00
Paul Huliganga fcb6ea1ac2 fix: path-efficiency reward (v3) defeats circular driving exploit
CRITICAL BUG FIX — Circular Driving:
- v2 reward still hackable: car circles at starting line with low CTE + positive speed
- Confirmed in data: trial 5 mean_reward=4582, cv=0.0% (physically impossible for genuine driving)
- Statistical signature: cv <1% with high reward = consistent exploit, not genuine driving

ROOT CAUSE: Neither CTE nor raw speed can distinguish forward vs circular motion.
Both have: low CTE (on centerline) + positive speed (moving) = same reward.
Missing dimension: TRACK PROGRESS (net advance along track)

FIX — Path Efficiency Reward (v3):
  efficiency = net_displacement / total_path_length  (sliding window of 30 steps)
  shaped = original x (1 + speed_scale x speed x efficiency)
  - Forward driving: efficiency ≈ 1.0 → full speed bonus
  - Circular driving: efficiency ≈ 0.0 → speed bonus disappears
  - Cannot be hacked: circling means returning to same positions (low net_displacement)

Tests:
  - test_efficiency_near_zero_for_circular_driving: confirmed <0.2 efficiency for circles
  - test_efficiency_near_one_for_straight_driving: confirmed >0.90 for straight line
  - test_straight_driving_gets_higher_reward_than_circular: KEY guarantee
  - test_speed_bonus_disappears_when_circling: bonus suppressed after window fills

Research documentation:
  - Full analysis with data table added to docs/RESEARCH_LOG.md
  - cv% identified as reward hacking indicator
  - Archived circular data + models

Clean start: new autoresearch_results_phase1.jsonl, new champion dir

Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +6 (path efficiency, anti-circular)
TypeScript: N/A
2026-04-13 13:36:17 -04:00
Paul Huliganga 5e93dae316 fix: hack-proof reward shaping + reward hacking detection + research log
CRITICAL BUG FIX — Reward Hacking:
- Old formula: speed × (1 - cte/max_cte) could be maximized by oscillating
  at track boundary regardless of on-track behavior (trials 8+13 hit 1936+1139)
- New formula: original_reward × (1 + speed_scale × speed) ONLY when on_track
- Off-track (original_reward ≤ 0) → zero speed bonus → cannot be hacked
- Verified hack-proof: 9 new targeted tests including test_cannot_hack_by_going_fast_off_track

Reward Hacking Auto-Detection:
- check_for_reward_hacking() flags results with >3.0 reward/step as suspected hacking
- Flagged results are excluded from GP fitting (won't optimize toward hacking params)
- reward_hacking_suspected field added to JSONL result records

Research Documentation:
- docs/RESEARCH_LOG.md created: full chronological research history
  - Random policy bug discovery and impact
  - Throttle clamp fix
  - Reward hacking discovery with evidence table
  - Hack-proof design rationale
  - Lessons learned + future research questions
- Archived corrupted Phase 1 data: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl
- Archived hacked models: models/ARCHIVED_reward_hacking/

Clean start: autoresearch_results_phase1.jsonl reset, models/champion reset

Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: +9 (reward wrapper hack-proof tests)
TypeScript: N/A
2026-04-13 12:27:48 -04:00
Paul Huliganga c804189dd0 feat: Wave 1 complete — real PPO training, model save, GP+UCB autoresearch, 37 tests passing
- Rebuilt donkeycar_sb3_runner.py: real PPO/DQN model.learn() + evaluate_policy() + model.save()
- Added SpeedRewardWrapper: reward = speed * (1 - |cte|/max_cte)
- Added ChampionTracker: tracks best model across all trials, writes manifest.json
- Rebuilt autoresearch_controller.py: Phase 1 results separated from random-policy data
- Added timesteps to GP search space
- Added --push-every N for automatic git push
- Added 37 passing tests: discretize_action, reward_wrapper, autoresearch_controller, runner_integration
- Scaffolded project with agent harness (large mode): PROJECT-SPEC, DECISIONS, IMPLEMENTATION_PLAN, EXECUTION_MASTER
- Fixed: model.save() never called before model is defined (was root cause of all prior NameError crashes)
- Fixed: random policy replaced with real trained policy evaluation

Agent: pi/claude-sonnet
Tests: 37/37 passing
Tests-Added: +37
TypeScript: N/A
2026-04-13 10:03:15 -04:00