The circle exploit persisted because the penalty alone (-100 per short lap) was insufficient. The model stayed alive between laps accumulating small positive rewards, making circling a viable strategy despite the penalty. Fix: _compute_reward_and_done() returns (reward, force_terminate). When a short lap is detected, force_terminate=True is returned and step() sets terminated=True immediately. The episode ends on the spot — no more rewards possible. This makes the circle exploit strictly worse than any forward driving behaviour. Tests updated: _compute_reward → _compute_reward_and_done, short-lap test now asserts force_terminate=True. Agent: pi Tests: 102 passed Tests-Added: 0 TypeScript: N/A |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| test_autoresearch_controller.py | ||
| test_behavioral_wrappers.py | ||
| test_discretize_action.py | ||
| test_end_to_end.py | ||
| test_reward_wrapper.py | ||
| test_runner_integration.py | ||
| test_wave3.py | ||