Commit Graph

5 Commits

Author SHA1 Message Date
Paul Huliganga c10e56d894 fix: cap total_timesteps at 120k to prevent 2hr timeout
Trials 3+4 both proposed ~140k steps and hit the 2hr JOB_TIMEOUT,
wasting time and producing no GP data.  At ~20 steps/sec, 120k steps
takes ~100 min, safely within the 2hr limit.

Agent: pi
Tests: 102 passed
Tests-Added: 0
TypeScript: N/A
2026-04-15 16:30:07 -04:00
Paul Huliganga f9f6a09744 fix: StuckTerminationWrapper + deque import + 102 tests
StuckTerminationWrapper added to wrap_env stack (between ThrottleClamp
and SpeedReward):
- Terminates episode after stuck_steps=80 steps with <0.5m displacement
- Handles slow barrier contact that Unity hit detection misses
- Handles off-lap-line circles (efficiency→0 gave zero reward but no
  termination; now gives -1.0 after 80 steps = ~4s of non-progress)
- Wrapper stack: ThrottleClamp → StuckTermination → SpeedReward

Also: missing deque import in multitrack_runner.py caused NameError.

Phase 4 results cleared again (Trial 1 ran without StuckTermination).

Tests: 2 new stuck-termination tests, 102 total.

Agent: pi
Tests: 102 passed
Tests-Added: 2
TypeScript: N/A
2026-04-15 09:17:27 -04:00
Paul Huliganga 5d1227833d fix: close short-lap circle exploit and cap segment eval episode length
Two reward hacking behaviours observed during Wave 4 training:

1. Short-lap circle exploit (reported by user, echoes Toni's guardrail hack):
   Model circles at start/finish line completing laps in 1-2 sim-seconds,
   accumulating lap_count indefinitely with no genuine track progress.
   Fix: SpeedRewardWrapper detects lap_count increment; if last_lap_time
   < min_lap_time (5.0s), returns penalty = -10 × (min_lap_time / lap_time).
   A 1-second lap gives -50 penalty. Legitimate 12-second laps unaffected.
   Window size also increased from 30 → 60 to catch slower circles.

2. Non-terminating segment eval episodes:
   evaluate_policy on wide tracks (no barriers) could run indefinitely,
   inflating segment_reward to 200k+. Replaced with manual eval loop
   capped at MAX_EVAL_STEPS=3000 steps.

Phase 4 results cleared (trials 4-6 ran with exploitable reward).

Tests: 4 new reward wrapper tests, 100 total passing.

Agent: pi
Tests: 100 passed
Tests-Added: 4
TypeScript: N/A
2026-04-15 09:06:25 -04:00
Paul Huliganga 1be95b7c82 wave3: autoresearch trial 5 results
Agent: pi
Tests: N/A
Tests-Added: 0
TypeScript: N/A
2026-04-15 07:15:57 -04:00
Paul Huliganga 860e3d6610 fix: fresh PPO verbose=0 suppressed all training output — set verbose=1
Without this, Wave 4 scratch-trained models produce no rollout stats in
the log, making it impossible to monitor training progress or spot
degenerate policies early.

Warm-start models in Wave 3 showed stats because verbose=1 was baked
into the Phase-2 saved model state; fresh models default to verbose=0.

Agent: pi
Tests: 96 passed
Tests-Added: 0
TypeScript: N/A
2026-04-14 22:44:22 -04:00