feat: Phase 3 — behavioral control, enhanced evaluator, 53 tests

PHASE 2 MILESTONE DOCUMENTED:
  All 3 top models complete the full track with distinct driving styles:
  - Trial 20 (n_steer=3): Right lane, stable steering — CHAMPION 
  - Trial 8  (n_steer=4): Left/center lane, oscillating (still completes!)
  - Trial 18 (n_steer=3): Right shoulder, very accurate line following
  Key finding: fewer steering bins (n_steer=3) = better driving (counterintuitive)
  CTE symmetry explains left/right preference: random NN init determines which side

BEHAVIORAL REWARD WRAPPERS (agent/behavioral_wrappers.py):
  - LanePositionWrapper: target a specific CTE offset (control left/right preference)
  - AntiOscillationWrapper: penalise rapid steering changes (fix Model 2 oscillation)
  - AsymmetricCTEWrapper: enforce right-lane rule (penalise left-of-centre more)
  - CombinedBehavioralWrapper: all three combined in one wrapper

ENHANCED EVALUATOR (agent/evaluate_champion.py):
  - Full metrics: reward, lap time, oscillation score, CTE distribution, lane position
  - --compare flag: runs all top Phase 2 models side by side with comparison table
  - Saves eval summary to outerloop-results/eval_summary.jsonl
  - Detects lap completion events from sim info dict

IMPLEMENTATION PLAN updated: Wave 3 streams defined
RESEARCH LOG updated: Phase 2 milestone, behavioral analysis, next steps
Champion updated to Trial 20 (Phase 2)

Agent: pi/claude-sonnet
Tests: 53/53 passing (+13 behavioral wrapper tests)
Tests-Added: +13
TypeScript: N/A
This commit is contained in:
Paul Huliganga 2026-04-14 09:28:43 -04:00
parent cfd1f843a4
commit e68d618d29
7 changed files with 825 additions and 183 deletions

View File

@ -6,72 +6,68 @@
--- ---
## Wave 1: Real Training Foundation ## ✅ Wave 1: Real Training Foundation — COMPLETE
**Goal:** Make the inner loop actually train and save models. Produce a real champion model. All tasks done. Phase 1 champion achieved genuine forward driving.
**Gate:** champion model achieves mean_reward > 100 on training track.
## ✅ Wave 2: Track Completion — COMPLETE
All top 3 Phase 2 models complete the full track.
Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps.
Driving style: Right lane, very stable. Completes full track in ~2874 steps.
Key finding: n_steer=3 > n_steer=4 (fewer bins = more decisive = less oscillation).
---
## Wave 3: Behavioral Control & Speed Optimization
**Goal:** Control driving style (lane, oscillation), measure lap time, optimize for speed.
**Gate:** Phase 2 champion completes full track (DONE ✅).
**Status:** 🟠 In progress **Status:** 🟠 In progress
### Stream 1A: Core Runner Rebuild ### Stream 3A: Enhanced Evaluator + Metrics
- [ ] **1A-01** — Rebuild `donkeycar_sb3_runner.py` with real PPO training (`model.learn()`), model save, and proper evaluation (`evaluate_policy()`) - [x] **3A-01** — Update champion to Phase 2 Trial 20
- [ ] **1A-02** — Add `SpeedRewardWrapper` — reward = `speed * (1 - abs(cte)/max_cte)`; add `--reward-shaping` flag - [ ] **3A-02** — Add lap time measurement to evaluate_champion.py
- [ ] **1A-03** — Add champion model tracking — write `champion_manifest.json` when new best is found - [ ] **3A-03** — Add steering oscillation metric (std of steering actions per episode)
- [ ] **1A-04** — Fix autoresearch controller to pass `learning_rate`, `save_dir`, `reward_shaping` args to runner - [ ] **3A-04** — Add lane position histogram (distribution of CTE values)
- [ ] **3A-05** — Save eval summary to `outerloop-results/eval_summary.jsonl`
### Stream 1B: Tests ### Stream 3B: Behavioral Reward Variants
- [ ] **1B-01** — Write `tests/test_discretize_action.py` — action encoding, decoding, round-trip - [ ] **3B-01**`LanePositionWrapper`: reward = `1 - abs(cte - target)/max_cte` with configurable target CTE offset
- [ ] **1B-02** — Write `tests/test_autoresearch_controller.py` — GP fit, UCB computation, param round-trip, champion tracking - [ ] **3B-02**`AntiOscillationWrapper`: adds penalty for rapid steering changes (smoothness reward)
- [ ] **1B-03** — Write `tests/test_runner_integration.py` — mocked sim, training + save + eval cycle - [ ] **3B-03**`AsymmetricCTEWrapper`: penalizes left-of-center more (enforces right-lane rule)
- [ ] **3B-04** — Tests for all three wrappers (no simulator required)
- [ ] **3B-05** — Integrate wrapper selection into autoresearch_controller.py via `--behavior` flag
### Stream 1C: First Real Autoresearch Run ### Stream 3C: Speed Optimization
- [ ] **1C-01** — Run 50-trial autoresearch with real PPO training; verify models saved - [ ] **3C-01** — Measure actual lap time using `last_lap_time` from sim info dict
- [ ] **1C-02** — Save regression baseline: `champion_reward_phase1.txt` - [ ] **3C-02** — Update reward to incorporate lap time: `reward += lap_bonus if lap_completed`
- [ ] **1C-03** — Push all results and models to Gitea - [ ] **3C-03** — Run targeted autoresearch starting from Phase 2 champion checkpoint
- [ ] **1C-04** — Write Wave 1 process eval - [ ] **3C-04** — Fine-tuning: load Phase 2 champion weights, continue training with speed reward
### Stream 3D: Multi-Track Generalization
- [ ] **3D-01** — Evaluate champion on 2nd track (e.g., `donkey-mountain-track-v0`)
- [ ] **3D-02** — Track-agnostic training: alternate episodes between 2 tracks
- [ ] **3D-03** — Measure generalization gap (train_track vs unseen_track reward)
--- ---
## Wave 2: Multi-Track Generalization ## Wave 4: Racing (future)
**Goal:** Champion model drives any track with mean_reward > 50. **Goal:** Fastest possible lap on any track.
**Gate:** Wave 1 champion achieves mean_reward > 100. Wave 1 process eval complete. **Gate:** Wave 3 complete. Multi-track generalization proven.
**Status:** ⏸️ Not started — blocked on Wave 1 **Status:** ⏸️ Not started
- [ ] **2-01** — Write `evaluate_champion.py` — load champion model, evaluate on specified track - [ ] **4-01** — Pure lap time reward (replace CTE-based reward with time-based)
- [ ] **2-02** — Implement multi-track training curriculum (train on 2 tracks alternately) - [ ] **4-02** — Head-to-head: autoresearch champion vs human-tuned config
- [ ] **2-03** — Add domain randomization wrapper (randomize road width, lighting) - [ ] **4-03** — Research paper / writeup structure
- [ ] **2-04** — Implement convergence detection in autoresearch (stop when GP sigma collapses)
- [ ] **2-05** — Add automatic Gitea push every N trials
- [ ] **2-06** — Evaluate champion on unseen track; record generalization gap
---
## Wave 3: Racing / Speed Optimization
**Goal:** Fastest possible lap times on any track.
**Gate:** Wave 2 champion generalizes to ≥1 unseen track (mean_reward > 50).
**Status:** ⏸️ Not started — blocked on Wave 2
- [ ] **3-01** — Implement lap time measurement and logging
- [ ] **3-02** — Tune reward function for pure speed (aggressive speed weight)
- [ ] **3-03** — Fine-tuning from champion checkpoint on new tracks
- [ ] **3-04** — Head-to-head: autoresearch champion vs human-tuned baseline
- [ ] **3-05** — Research writeup / report
---
## Completion Signals
The agent outputs one of these at the end of each iteration:
- `<promise>PLANNED</promise>` — just created/updated the plan, ready to implement
- `<promise>DONE</promise>` — all tasks in current wave complete
- `<promise>STUCK</promise>` — needs human input (see ESCALATION REQUIRED block if present)
- `<promise>ERROR</promise>` — unrecoverable error
--- ---
## Notes ## Notes
- **Random policy data (300 trials):** The existing autoresearch_results.jsonl contains rewards from random-action policy runs. These are valid for n_steer/n_throttle discretization insights but NOT for learning_rate optimization. Do not mix with Phase 1 real training results. Create a separate results file: `autoresearch_results_phase1.jsonl`. - **Phase 2 key finding:** n_steer=3 outperforms n_steer=4 (counterintuitive — fewer bins = better)
- **Model storage:** Large CNN models (>100MB) should be excluded from git or use git LFS. Add `agent/models/**/*.zip` to .gitignore if needed, and document download location. - **CTE symmetry:** reward is symmetric → car picks left or right based on random NN init
- **Simulator requirement:** All live training tasks (1C-*) require DonkeyCar sim running on port 9091. Tests (1B-*) do NOT require the simulator. - **Track ends!** The track has a physical finish — runs end on track completion, not timeout
- **Reward v4 (base × efficiency × speed):** Successfully eliminated all circular driving exploits
- **Champion model path:** `agent/models/champion/model.zip` (Trial 20, Phase 2)

View File

@ -0,0 +1,277 @@
"""
Behavioral Reward Wrappers for DonkeyCar RL Phase 3
======================================================
These wrappers extend the base SpeedRewardWrapper (v4) with behavioral
control mechanisms discovered in Phase 2:
1. LanePositionWrapper drive at a specific lateral position
2. AntiOscillationWrapper suppress steering oscillation
3. AsymmetricCTEWrapper enforce right-lane rule (penalise left more)
RESEARCH CONTEXT (Phase 2 findings):
- The base CTE reward is symmetric car picks left or right based on
random NN initialisation different driving styles emerge randomly
- n_steer=3 (fewer bins) produces cleaner, more stable driving than n_steer=4
- These wrappers let us deliberately shape driving behaviour
USAGE:
from reward_wrapper import SpeedRewardWrapper
from behavioral_wrappers import LanePositionWrapper, AntiOscillationWrapper
env = LanePositionWrapper(
AntiOscillationWrapper(
SpeedRewardWrapper(base_env),
oscillation_penalty=0.05
),
target_cte=-0.3, # Slightly right of centre
position_weight=0.3
)
"""
import gymnasium as gym
import numpy as np
from collections import deque
class LanePositionWrapper(gym.Wrapper):
"""
Biases the car to drive at a specific lateral position (target CTE).
Adds a position bonus/penalty on top of any existing shaped reward:
position_bonus = position_weight × (1 - abs(cte - target_cte) / max_cte)
Examples:
target_cte = 0.0 drive on centre line (default CTE behaviour)
target_cte = -0.5 drive slightly right of centre (right-lane rule)
target_cte = +0.5 drive slightly left of centre
target_cte = -1.5 hug the right shoulder (like Trial 18!)
Args:
target_cte: desired CTE offset from centre (negative = right)
position_weight: how strongly to enforce the target (0=off, 0.3=moderate)
max_cte: track half-width (default 8.0, matches sim)
"""
def __init__(self, env, target_cte: float = 0.0, position_weight: float = 0.2, max_cte: float = 8.0):
super().__init__(env)
self.target_cte = target_cte
self.position_weight = position_weight
self.max_cte = max_cte
def step(self, action):
result = self.env.step(action)
if len(result) == 5:
obs, reward, terminated, truncated, info = result
else:
obs, reward, done, info = result
terminated, truncated = done, False
cte = float(info.get('cte', 0.0) or 0.0)
position_bonus = self.position_weight * (
1.0 - min(abs(cte - self.target_cte) / self.max_cte, 1.0)
)
shaped = reward + position_bonus if reward > 0 else reward # Only bonus when on track
if len(result) == 5:
return obs, shaped, terminated, truncated, info
return obs, shaped, terminated, info
class AntiOscillationWrapper(gym.Wrapper):
"""
Penalises rapid changes in steering to suppress oscillating driving.
Addresses the behaviour observed in Trial 8 (n_steer=4, oscillating).
Computes the change in steering from the previous step and subtracts
a scaled penalty from the reward.
oscillation_penalty_amount = oscillation_penalty × |Δsteering|
The steered action must be a continuous value or index we track the
last action and penalise large changes.
Args:
oscillation_penalty: scale factor for the steering change penalty
history_window: number of steps to compute average oscillation over
"""
def __init__(self, env, oscillation_penalty: float = 0.05, history_window: int = 10):
super().__init__(env)
self.oscillation_penalty = oscillation_penalty
self.history_window = history_window
self._action_history = deque(maxlen=history_window)
self._last_action = None
def reset(self, **kwargs):
result = self.env.reset(**kwargs)
self._action_history.clear()
self._last_action = None
return result
def step(self, action):
result = self.env.step(action)
if len(result) == 5:
obs, reward, terminated, truncated, info = result
else:
obs, reward, done, info = result
terminated, truncated = done, False
# Compute steering change penalty
if self._last_action is not None:
try:
curr = float(action[0]) if hasattr(action, '__len__') else float(action)
prev = float(self._last_action[0]) if hasattr(self._last_action, '__len__') else float(self._last_action)
delta = abs(curr - prev)
penalty = self.oscillation_penalty * delta
shaped = reward - penalty if reward > 0 else reward
except (TypeError, IndexError):
shaped = reward
else:
shaped = reward
self._last_action = action
self._action_history.append(action)
if len(result) == 5:
return obs, shaped, terminated, truncated, info
return obs, shaped, terminated, info
def current_oscillation_score(self) -> float:
"""Returns mean absolute steering change over history window."""
if len(self._action_history) < 2:
return 0.0
actions = list(self._action_history)
deltas = []
for i in range(1, len(actions)):
try:
curr = float(actions[i][0]) if hasattr(actions[i], '__len__') else float(actions[i])
prev = float(actions[i-1][0]) if hasattr(actions[i-1], '__len__') else float(actions[i-1])
deltas.append(abs(curr - prev))
except (TypeError, IndexError):
pass
return float(np.mean(deltas)) if deltas else 0.0
class AsymmetricCTEWrapper(gym.Wrapper):
"""
Enforces right-lane driving by penalising left-of-centre more than right.
In the default reward, CTE is symmetric |CTE| only. This wrapper
applies an extra penalty when the car drifts left (positive CTE in
DonkeyCar convention means left-of-centre).
Formula:
if cte > 0 (left of centre): extra_penalty = left_penalty × cte / max_cte
if cte < 0 (right of centre): no penalty (or small bonus)
Args:
left_penalty: additional penalty multiplier for left-of-centre driving
right_bonus: small bonus for right-of-centre driving (optional)
max_cte: track half-width (default 8.0)
"""
def __init__(self, env, left_penalty: float = 0.3, right_bonus: float = 0.05, max_cte: float = 8.0):
super().__init__(env)
self.left_penalty = left_penalty
self.right_bonus = right_bonus
self.max_cte = max_cte
def step(self, action):
result = self.env.step(action)
if len(result) == 5:
obs, reward, terminated, truncated, info = result
else:
obs, reward, done, info = result
terminated, truncated = done, False
if reward > 0: # Only modify reward when on track
cte = float(info.get('cte', 0.0) or 0.0)
if cte > 0: # Left of centre — penalise
penalty = self.left_penalty * min(cte / self.max_cte, 1.0)
shaped = reward * (1.0 - penalty)
else: # Right of centre — small bonus
bonus = self.right_bonus * min(abs(cte) / self.max_cte, 1.0)
shaped = reward * (1.0 + bonus)
else:
shaped = reward
if len(result) == 5:
return obs, shaped, terminated, truncated, info
return obs, shaped, terminated, info
class CombinedBehavioralWrapper(gym.Wrapper):
"""
Convenience wrapper combining all three behavioral controls.
Apply this on top of SpeedRewardWrapper (v4).
Args:
target_cte: desired lateral position (default 0.0 = centre)
position_weight: lane position enforcement strength (default 0.2)
oscillation_penalty: steering smoothness enforcement (default 0.05)
enforce_right_lane: if True, apply asymmetric CTE penalty (default False)
max_cte: track half-width (default 8.0)
"""
def __init__(
self,
env,
target_cte: float = 0.0,
position_weight: float = 0.2,
oscillation_penalty: float = 0.05,
enforce_right_lane: bool = False,
max_cte: float = 8.0,
):
super().__init__(env)
self.target_cte = target_cte
self.position_weight = position_weight
self.oscillation_penalty = oscillation_penalty
self.enforce_right_lane = enforce_right_lane
self.max_cte = max_cte
self._last_action = None
def reset(self, **kwargs):
self._last_action = None
return self.env.reset(**kwargs)
def step(self, action):
result = self.env.step(action)
if len(result) == 5:
obs, reward, terminated, truncated, info = result
else:
obs, reward, done, info = result
terminated, truncated = done, False
cte = float(info.get('cte', 0.0) or 0.0)
if reward > 0:
shaped = reward
# 1. Lane position bonus
pos_bonus = self.position_weight * (
1.0 - min(abs(cte - self.target_cte) / self.max_cte, 1.0)
)
shaped += pos_bonus
# 2. Anti-oscillation penalty
if self._last_action is not None:
try:
curr = float(action[0]) if hasattr(action, '__len__') else float(action)
prev = float(self._last_action[0]) if hasattr(self._last_action, '__len__') else float(self._last_action)
shaped -= self.oscillation_penalty * abs(curr - prev)
except (TypeError, IndexError):
pass
# 3. Right-lane enforcement (asymmetric CTE)
if self.enforce_right_lane and cte > 0:
penalty = 0.3 * min(cte / self.max_cte, 1.0)
shaped *= (1.0 - penalty)
else:
shaped = reward
self._last_action = action
if len(result) == 5:
return obs, shaped, terminated, truncated, info
return obs, shaped, terminated, info

View File

@ -1,77 +1,115 @@
""" """
Champion Model Evaluator Enhanced Champion Evaluator Phase 3
======================== ======================================
Loads the champion model and runs it live in the simulator for visual inspection. Evaluates a model with full metrics:
Prints per-step diagnostics: position, speed, CTE, efficiency, reward. - Total reward per episode
- Lap time (using sim's last_lap_time)
- Steering oscillation score (std of steering changes)
- Lane position histogram (CTE distribution)
- Path efficiency throughout episode
- Per-step diagnostics: speed, CTE, efficiency, reward, position
Usage: Usage:
python3 evaluate_champion.py [--episodes N] [--steps N] # Evaluate current champion
python3 evaluate_champion.py
Watch the simulator window to see if the car is genuinely driving the track # Evaluate a specific model
or exploiting circular motion. python3 evaluate_champion.py --model models/trial-0020/model.zip
# Long run to see lap completion
python3 evaluate_champion.py --episodes 3 --steps 3000
# Compare all top Phase 2 models
python3 evaluate_champion.py --compare
""" """
import os import os
import sys import sys
import time import time
import json import json
import math
import numpy as np import numpy as np
from collections import deque from collections import deque
from datetime import datetime
import gymnasium as gym import gymnasium as gym
import gym_donkeycar import gym_donkeycar
from stable_baselines3 import PPO from stable_baselines3 import PPO
# Add agent dir to path for wrappers
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from reward_wrapper import SpeedRewardWrapper
from donkeycar_sb3_runner import ThrottleClampWrapper from donkeycar_sb3_runner import ThrottleClampWrapper
from reward_wrapper import SpeedRewardWrapper
CHAMPION_DIR = os.path.join(os.path.dirname(__file__), 'models', 'champion') CHAMPION_DIR = os.path.join(os.path.dirname(__file__), 'models', 'champion')
MANIFEST_PATH = os.path.join(CHAMPION_DIR, 'manifest.json') MANIFEST_PATH = os.path.join(CHAMPION_DIR, 'manifest.json')
MODEL_PATH = os.path.join(CHAMPION_DIR, 'model.zip') EVAL_SUMMARY = os.path.join(os.path.dirname(__file__), 'outerloop-results', 'eval_summary.jsonl')
# Top Phase 2 models for comparison
PHASE2_MODELS = [
{
'label': 'Trial-20 Phase2-CHAMPION (n_steer=3 n_throttle=5 lr=0.000225 13k)',
'path': 'models/trial-0020/model.zip',
'style': 'Right lane, stable',
},
{
'label': 'Trial-8 Phase2-2nd (n_steer=4 n_throttle=3 lr=0.00117 34k)',
'path': 'models/trial-0008/model.zip',
'style': 'Left/center, oscillating',
},
{
'label': 'Trial-18 Phase2-3rd (n_steer=3 n_throttle=5 lr=0.000288 16k)',
'path': 'models/trial-0018/model.zip',
'style': 'Right shoulder, very accurate',
},
]
def load_manifest(): def load_manifest():
if os.path.exists(MANIFEST_PATH):
with open(MANIFEST_PATH) as f: with open(MANIFEST_PATH) as f:
return json.load(f) return json.load(f)
return {}
def print_banner(manifest):
print('=' * 65, flush=True)
print('🏆 DonkeyCar Champion Model Evaluation', flush=True)
print('=' * 65, flush=True)
print(f" Trial: {manifest['trial']}", flush=True)
print(f" mean_reward: {manifest['mean_reward']:.4f}", flush=True)
print(f" Params: {manifest['params']}", flush=True)
print(f" Model: {MODEL_PATH}", flush=True)
print('=' * 65, flush=True)
print(flush=True)
def compute_efficiency(pos_history): def compute_efficiency(pos_history):
"""Path efficiency = net_displacement / total_path_length over window."""
if len(pos_history) < 3: if len(pos_history) < 3:
return 1.0 return 1.0
positions = list(pos_history) positions = list(pos_history)
net = np.linalg.norm(np.array(positions[-1]) - np.array(positions[0])) net = np.linalg.norm(np.array(positions[-1]) - np.array(positions[0]))
total = sum( total = sum(np.linalg.norm(np.array(positions[i+1]) - np.array(positions[i]))
np.linalg.norm(np.array(positions[i+1]) - np.array(positions[i])) for i in range(len(positions)-1))
for i in range(len(positions)-1)
)
return float(net / total) if total > 1e-6 else 1.0 return float(net / total) if total > 1e-6 else 1.0
def run_episode(model, env, episode_num, max_steps=500): def print_banner(label, path):
"""Run one episode with the champion policy, printing diagnostics.""" print(f'\n{"="*68}', flush=True)
print(f'\n--- Episode {episode_num} ---', flush=True) print(f'🔍 {label}', flush=True)
print(f' {path}', flush=True)
print(f'{"="*68}', flush=True)
def run_eval(model, env, episodes, max_steps, label=''):
"""Run evaluation and return full metrics."""
all_rewards = []
all_steps = []
all_lap_times = []
all_osc_scores = []
all_cte_distributions = []
all_completed = []
for ep in range(1, episodes + 1):
obs, info = env.reset() obs, info = env.reset()
pos_history = deque(maxlen=30) pos_hist = deque(maxlen=31)
total_reward = 0.0 total_reward = 0.0
step = 0 step = 0
cte_values = []
steering_actions = []
laps_completed = 0
lap_times = []
print(f'{"Step":>5} {"Speed":>6} {"CTE":>7} {"Eff%":>6} {"Rwd":>8} {"TotRwd":>10} {"Pos_x":>8} {"Pos_z":>8}', flush=True) print(f'\n--- Episode {ep}/{episodes} ---', flush=True)
print('-' * 65, flush=True) print(f'{"Step":>5} {"Spd":>5} {"CTE":>6} {"Eff%":>5} {"Rwd":>7} {"Tot":>9} {"Laps":>5} {"Px":>7} {"Pz":>7}', flush=True)
print('-' * 62, flush=True)
while step < max_steps: while step < max_steps:
action, _ = model.predict(obs, deterministic=True) action, _ = model.predict(obs, deterministic=True)
@ -82,88 +120,172 @@ def run_episode(model, env, episode_num, max_steps=500):
else: else:
obs, reward, done, info = result obs, reward, done, info = result
# Extract diagnostics from info speed = float(info.get('speed', 0) or 0)
speed = float(info.get('speed', 0.0) or 0.0) cte = float(info.get('cte', 0) or 0)
cte = float(info.get('cte', 0.0) or 0.0) pos = info.get('pos', (0, 0, 0))
pos = info.get('pos', None) px = pos[0] if pos else 0
if pos is not None: pz = pos[2] if len(pos) > 2 else 0
pos_history.append(list(pos)[:3]) lap_count = int(info.get('lap_count', 0) or 0)
px, pz = pos[0], pos[2] if len(pos) > 2 else 0.0 last_lap_time = float(info.get('last_lap_time', 0) or 0)
else:
px, pz = 0.0, 0.0 # Track new laps
if lap_count > laps_completed:
laps_completed = lap_count
if last_lap_time > 0:
lap_times.append(last_lap_time)
print(f'\n 🏁 LAP {laps_completed} COMPLETE! Time={last_lap_time:.2f}s', flush=True)
pos_hist.append(np.array([px, 0., pz]))
cte_values.append(cte)
# Track steering for oscillation score
try:
steer = float(action[0]) if hasattr(action, '__len__') else float(action)
steering_actions.append(steer)
except (TypeError, IndexError):
pass
efficiency = compute_efficiency(pos_history)
total_reward += reward total_reward += reward
step += 1 step += 1
# Print every 10 steps or on done eff = compute_efficiency(pos_hist)
if step % 10 == 0 or done:
print(f'{step:>5} {speed:>6.2f} {cte:>7.3f} {efficiency*100:>5.1f}% {reward:>8.3f} {total_reward:>10.2f} {px:>8.2f} {pz:>8.2f}', flush=True) if step % 50 == 0 or done:
print(f'{step:>5} {speed:>5.2f} {cte:>6.2f} {eff*100:>4.0f}% '
f'{reward:>7.3f} {total_reward:>9.1f} {laps_completed:>5} '
f'{px:>7.1f} {pz:>7.1f}', flush=True)
if done: if done:
print(f'\n ✅ Episode {episode_num} done after {step} steps | total_reward={total_reward:.2f}', flush=True) print(f'\n Episode {ep} ended after {step} steps | '
f'total={total_reward:.1f} | laps={laps_completed}', flush=True)
break break
if step >= max_steps: if step >= max_steps:
print(f'\n ⏱️ Episode {episode_num} reached max_steps={max_steps} | total_reward={total_reward:.2f}', flush=True) print(f'\n Episode {ep} reached max {max_steps} steps | '
f'total={total_reward:.1f} | laps={laps_completed}', flush=True)
return total_reward, step # Compute oscillation score
if len(steering_actions) > 1:
deltas = [abs(steering_actions[i] - steering_actions[i-1])
for i in range(1, len(steering_actions))]
osc_score = float(np.mean(deltas))
else:
osc_score = 0.0
all_rewards.append(total_reward)
all_steps.append(step)
all_lap_times.extend(lap_times)
all_osc_scores.append(osc_score)
all_cte_distributions.extend(cte_values)
all_completed.append(laps_completed > 0)
time.sleep(2)
# Summary metrics
summary = {
'label': label,
'episodes': episodes,
'mean_reward': float(np.mean(all_rewards)),
'std_reward': float(np.std(all_rewards)),
'mean_steps': float(np.mean(all_steps)),
'laps_completed': sum(1 for r in all_rewards if r > 500), # proxy for completion
'lap_times': all_lap_times,
'mean_lap_time': float(np.mean(all_lap_times)) if all_lap_times else None,
'oscillation_score': float(np.mean(all_osc_scores)), # lower = smoother
'mean_abs_cte': float(np.mean([abs(c) for c in all_cte_distributions])),
'cte_std': float(np.std(all_cte_distributions)),
'mean_cte_signed': float(np.mean(all_cte_distributions)), # + = left, - = right
'timestamp': datetime.now().isoformat(),
}
return summary, all_rewards
def main(episodes=3, max_steps=500): def print_summary(summary):
print(f'\n📊 Metrics for: {summary["label"]}', flush=True)
print(f' Mean reward: {summary["mean_reward"]:.1f} ± {summary["std_reward"]:.1f}', flush=True)
print(f' Mean steps/ep: {summary["mean_steps"]:.0f}', flush=True)
print(f' Oscillation score: {summary["oscillation_score"]:.4f} (lower=smoother)', flush=True)
print(f' Mean |CTE|: {summary["mean_abs_cte"]:.3f} m from centre', flush=True)
print(f' Mean signed CTE: {summary["mean_cte_signed"]:.3f} m (+ =left, - =right)', flush=True)
cte_side = 'RIGHT of centre ➡️' if summary['mean_cte_signed'] < -0.1 else \
'LEFT of centre ⬅️' if summary['mean_cte_signed'] > 0.1 else 'CENTRED ↕️'
print(f' Lane position: {cte_side}', flush=True)
if summary['lap_times']:
print(f' Lap times: {[f"{t:.1f}s" for t in summary["lap_times"]]}', flush=True)
print(f' Best lap time: {min(summary["lap_times"]):.1f}s', flush=True)
print(flush=True)
def save_summary(summary):
os.makedirs(os.path.dirname(EVAL_SUMMARY), exist_ok=True)
with open(EVAL_SUMMARY, 'a') as f:
f.write(json.dumps(summary) + '\n')
def main(episodes=3, max_steps=3000, model_override=None, compare=False):
manifest = load_manifest() manifest = load_manifest()
print_banner(manifest)
params = manifest['params'] models_to_eval = []
if compare:
for m in PHASE2_MODELS:
models_to_eval.append((m['label'], m['path']))
else:
path = model_override or CHAMPION_DIR + '/model.zip'
label = model_override or f"Champion (Phase {manifest.get('phase', '?')} Trial {manifest.get('trial', '?')})"
models_to_eval.append((label, path))
all_summaries = []
for label, path in models_to_eval:
print_banner(label, path)
print(f'[Eval] Connecting to simulator...', flush=True) print(f'[Eval] Connecting to simulator...', flush=True)
try: try:
env = gym.make('donkey-generated-roads-v0') env = gym.make('donkey-generated-roads-v0')
except Exception as e: except Exception as e:
print(f'[Eval] FAILED to connect: {e}', flush=True) print(f'[Eval] FAILED: {e}', flush=True)
sys.exit(1) sys.exit(1)
# Apply same wrappers as training
env = ThrottleClampWrapper(env, throttle_min=0.2) env = ThrottleClampWrapper(env, throttle_min=0.2)
env = SpeedRewardWrapper(env, speed_scale=0.1) env = SpeedRewardWrapper(env, speed_scale=0.1)
print(f'[Eval] Wrappers applied: ThrottleClamp(min=0.2), SpeedRewardWrapper(scale=0.1)', flush=True)
print(f'[Eval] Loading champion model from {MODEL_PATH}...', flush=True) print(f'[Eval] Loading model: {path}', flush=True)
try: try:
model = PPO.load(MODEL_PATH, env=env) model = PPO.load(path, env=env)
print(f'[Eval] Model loaded successfully.', flush=True) print(f'[Eval] Model loaded. Running {episodes} episodes × {max_steps} steps...', flush=True)
except Exception as e: except Exception as e:
print(f'[Eval] FAILED to load model: {e}', flush=True) print(f'[Eval] FAILED to load: {e}', flush=True)
env.close() env.close()
sys.exit(1) continue
print(f'\n[Eval] Running {episodes} episodes (max {max_steps} steps each)...', flush=True) summary, rewards = run_eval(model, env, episodes, max_steps, label)
print('[Eval] Watch the simulator window — is the car driving the track or circling?', flush=True) print_summary(summary)
save_summary(summary)
all_rewards = [] all_summaries.append(summary)
for ep in range(1, episodes + 1):
total_reward, steps = run_episode(model, env, ep, max_steps=max_steps)
all_rewards.append(total_reward)
if ep < episodes:
time.sleep(2) # Brief pause between episodes
print('\n' + '=' * 65, flush=True)
print('📊 Evaluation Complete', flush=True)
print(f' Episodes: {episodes}', flush=True)
print(f' Rewards: {[f"{r:.1f}" for r in all_rewards]}', flush=True)
print(f' Mean reward: {sum(all_rewards)/len(all_rewards):.2f}', flush=True)
print(f' Std reward: {float(np.std(all_rewards)):.2f}', flush=True)
print('=' * 65, flush=True)
env.close() env.close()
time.sleep(2) time.sleep(3)
print('[Eval] Done.', flush=True)
if compare and len(all_summaries) > 1:
print('\n' + '=' * 68, flush=True)
print('🏁 COMPARISON TABLE', flush=True)
print('=' * 68, flush=True)
print(f'{"Model":<40} {"Reward":>8} {"Steps":>7} {"Osc":>6} {"CTE":>6} {"Side":>10}', flush=True)
print('-' * 68, flush=True)
for s in all_summaries:
side = '➡️ RIGHT' if s['mean_cte_signed'] < -0.1 else \
'⬅️ LEFT' if s['mean_cte_signed'] > 0.1 else '↕️ CENTER'
name = s['label'][:40]
print(f'{name:<40} {s["mean_reward"]:>8.0f} {s["mean_steps"]:>7.0f} '
f'{s["oscillation_score"]:>6.3f} {s["mean_abs_cte"]:>6.2f} {side:>10}', flush=True)
if __name__ == '__main__': if __name__ == '__main__':
import argparse import argparse
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser(description='Evaluate DonkeyCar RL model with full metrics.')
parser.add_argument('--episodes', type=int, default=3, help='Number of eval episodes') parser.add_argument('--episodes', type=int, default=3)
parser.add_argument('--steps', type=int, default=500, help='Max steps per episode') parser.add_argument('--steps', type=int, default=3000)
parser.add_argument('--model', type=str, default=None, help='Override model path')
parser.add_argument('--compare', action='store_true', help='Compare all top Phase 2 models')
args = parser.parse_args() args = parser.parse_args()
main(episodes=args.episodes, max_steps=args.steps) main(episodes=args.episodes, max_steps=args.steps, model_override=args.model, compare=args.compare)

View File

@ -1,15 +1,18 @@
{ {
"trial": 5, "trial": 20,
"timestamp": "2026-04-13T12:45:43.093664", "phase": 2,
"timestamp": "2026-04-14T09:25:40.280224",
"params": { "params": {
"n_steer": 7, "n_steer": 3,
"n_throttle": 3, "n_throttle": 5,
"learning_rate": 0.0006801262090358742, "learning_rate": 0.00022474333387549633,
"timesteps": 4787, "timesteps": 13328,
"agent": "ppo", "agent": "ppo",
"eval_episodes": 3, "eval_episodes": 5,
"reward_shaping": true "reward_shaping": true
}, },
"mean_reward": 4582.7984, "mean_reward": 2469.28,
"eval_steps": 2874,
"driving_style": "Right lane, very stable, completes full track",
"model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/champion/model.zip" "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/champion/model.zip"
} }

View File

@ -475,3 +475,17 @@
[2026-04-14 04:35:49] mean_reward=2073.7372 params={'n_steer': 3, 'n_throttle': 5, 'learning_rate': 0.0002881292103575585, 'timesteps': 15876, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True} [2026-04-14 04:35:49] mean_reward=2073.7372 params={'n_steer': 3, 'n_throttle': 5, 'learning_rate': 0.0002881292103575585, 'timesteps': 15876, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True}
[2026-04-14 04:35:49] mean_reward=1382.4461 params={'n_steer': 4, 'n_throttle': 3, 'learning_rate': 0.0010723485700433605, 'timesteps': 33234, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True} [2026-04-14 04:35:49] mean_reward=1382.4461 params={'n_steer': 4, 'n_throttle': 3, 'learning_rate': 0.0010723485700433605, 'timesteps': 33234, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True}
[2026-04-14 04:35:49] mean_reward=1097.1248 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.001421177467065464, 'timesteps': 33363, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True} [2026-04-14 04:35:49] mean_reward=1097.1248 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.001421177467065464, 'timesteps': 33363, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True}
[2026-04-14 04:35:50] [AutoResearch] Git push complete after trial 20
[2026-04-14 09:28:23] [AutoResearch] GP UCB top-5 candidates:
[2026-04-14 09:28:23] UCB=2.3107 mu=0.3981 sigma=0.9563 params={'n_steer': 9, 'n_throttle': 2, 'learning_rate': 0.001405531880392808, 'timesteps': 26173}
[2026-04-14 09:28:23] UCB=2.3049 mu=0.8602 sigma=0.7224 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.001793493447174312, 'timesteps': 19198}
[2026-04-14 09:28:23] UCB=2.2813 mu=0.4904 sigma=0.8954 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011616192816742616, 'timesteps': 13887}
[2026-04-14 09:28:23] UCB=2.2767 mu=0.5194 sigma=0.8787 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011646447444663046, 'timesteps': 21199}
[2026-04-14 09:28:23] UCB=2.2525 mu=0.6254 sigma=0.8136 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0010196345864901517, 'timesteps': 22035}
[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=50.0000 params={'n_steer': 5}
[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'n_steer': 7}
[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 0: mean_reward=50.0000 params={'r': 50}
[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'r': 80}
[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 3: mean_reward=90.0000 params={'r': 90}
[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 5: mean_reward=75.0000 params={'n_steer': 8}
[2026-04-14 09:28:23] [AutoResearch] Only 1 results — using random proposal.

View File

@ -363,3 +363,54 @@ v4 guarantee that the worst-case exploit (CTE=0 spinning) earns near-zero reward
**The lesson:** When efficiency is only applied to the SPEED BONUS, the base reward from **The lesson:** When efficiency is only applied to the SPEED BONUS, the base reward from
the sim can still be gamed. The efficiency multiplier must apply to the ENTIRE reward. the sim can still be gamed. The efficiency multiplier must apply to the ENTIRE reward.
---
## 2026-04-14 — 🏆 PHASE 2 MILESTONE: All Top Models Complete the Track!
### Finding: Track Completion Achieved — Multiple Distinct Driving Styles
**User visual confirmation:** All 3 top Phase 2 models successfully complete the entire track!
**Model comparison at 3000 steps:**
| Model | Steps | Reward | Std | Driving Style |
|-------|-------|--------|-----|---------------|
| Trial 20 (n_steer=3, n_throttle=5, lr=0.000225, 13k steps) | **2874** | 2297 | 5.7 | Right lane, very stable ⭐ |
| Trial 8 (n_steer=4, n_throttle=3, lr=0.00117, 34k steps) | 2258 | 2072 | 0.4 | Left/center, oscillating |
| Trial 18 (n_steer=3, n_throttle=5, lr=0.000288, 16k steps) | 2256 | 2072 | 0.4 | Right shoulder, very accurate |
**Key insight — the track ENDS!** The runs don't time out — the car genuinely completes the full track. The CTE spike at the end is the car reaching the track boundary/finish.
### Why Different Driving Styles Emerged
**Action space discretization is the dominant factor:**
- `n_steer=3`: Only LEFT/STRAIGHT/RIGHT → decisive, committed steering → clean lane following
- `n_steer=4`: 4 steer positions → oscillating correction policy (still completes track)
- `n_throttle=5`: More speed granularity → smoother corner negotiation
**CTE reward symmetry creates multiple valid solutions:**
The reward `base_CTE × efficiency × speed` is symmetric — driving 0.5m left of center = driving 0.5m right of center (same |CTE|). PPO random initialization determines which symmetric solution the model converges to. This is why Trials 20 and 18 drive on opposite sides of the road despite similar hyperparameters.
**Emergent counterintuitive finding: FEWER steering bins → BETTER driving**
Trial 20 (n_steer=3) outperforms Trial 8 (n_steer=4) both in distance and smoothness. With only 3 steering bins, the model is forced to commit to decisive actions, developing a cleaner driving policy. More action granularity introduced oscillation without improving performance.
### Can We Control Driving Behaviour?
Yes! Through targeted reward shaping:
1. **Lane position targeting**: `reward = 1 - abs(cte - target_offset)/max_cte` → bias to specific lane position
2. **Anti-oscillation penalty**: Penalize rapid steering changes → eliminates Model 2 oscillation
3. **Asymmetric CTE**: Penalize left-of-center more → enforces right-lane driving rule
4. **Speed zones**: Reward deceleration before corners (future work)
### Phase 2 → Phase 3 Transition
**Phase 2 objective ACHIEVED:** Models complete the full track with genuine learned driving behaviour.
**Phase 3 objectives:**
- Behavioral control (lane position, oscillation suppression)
- Speed optimization (fastest lap time)
- Multi-track generalization
- Fine-tuning from Phase 2 champion
**Phase 2 Champion:** Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps

View File

@ -0,0 +1,179 @@
"""
Tests for behavioral_wrappers.py no simulator required.
"""
import sys, os, math, pytest
import numpy as np
import gymnasium as gym
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'agent'))
from behavioral_wrappers import LanePositionWrapper, AntiOscillationWrapper, AsymmetricCTEWrapper, CombinedBehavioralWrapper
class MockEnv(gym.Env):
metadata = {'render_modes': []}
def __init__(self, reward=0.8, cte=0.0, done=False):
super().__init__()
self.action_space = gym.spaces.Box(low=np.array([-1.0, 0.2]), high=np.array([1.0, 1.0]), dtype=np.float32)
self.observation_space = gym.spaces.Box(0, 255, (120, 160, 3), dtype=np.uint8)
self._reward = reward
self._cte = cte
self._done = done
def set(self, reward=None, cte=None):
if reward is not None: self._reward = reward
if cte is not None: self._cte = cte
def reset(self, seed=None, **kwargs):
return np.zeros((120, 160, 3), dtype=np.uint8), {}
def step(self, action):
obs = np.zeros((120, 160, 3), dtype=np.uint8)
info = {'cte': self._cte, 'speed': 2.0, 'lap_count': 0, 'last_lap_time': 0.0}
return obs, self._reward, self._done, False, info
def close(self): pass
# ---- LanePositionWrapper Tests ----
def test_lane_position_bonus_at_target():
"""At the target CTE, position bonus is maximized."""
env = MockEnv(reward=0.8, cte=-0.5) # Car at CTE=-0.5
wrapped = LanePositionWrapper(env, target_cte=-0.5, position_weight=0.2)
wrapped.reset()
_, r, _, _, _ = wrapped.step(np.array([0.0, 0.5]))
# Should get max bonus: reward + 0.2 * 1.0 = 1.0
assert r == pytest.approx(1.0, abs=0.01)
def test_lane_position_reduces_reward_away_from_target():
"""Away from target CTE, position bonus is smaller."""
env_near = MockEnv(reward=0.8, cte=-0.5)
env_far = MockEnv(reward=0.8, cte=2.0)
wrapped_near = LanePositionWrapper(env_near, target_cte=-0.5, position_weight=0.2)
wrapped_far = LanePositionWrapper(env_far, target_cte=-0.5, position_weight=0.2)
wrapped_near.reset()
wrapped_far.reset()
_, r_near, _, _, _ = wrapped_near.step(np.array([0.0, 0.5]))
_, r_far, _, _, _ = wrapped_far.step(np.array([0.0, 0.5]))
assert r_near > r_far
def test_lane_position_no_bonus_when_off_track():
"""No position bonus when original reward <= 0 (off track)."""
env = MockEnv(reward=-1.0, cte=0.0) # Crashed, perfect CTE
wrapped = LanePositionWrapper(env, target_cte=0.0, position_weight=0.5)
wrapped.reset()
_, r, _, _, _ = wrapped.step(np.array([0.0, 0.5]))
assert r == -1.0
def test_right_of_centre_target_biases_right():
"""Setting target_cte=-0.5 (right) gives higher reward for right-of-centre."""
env_right = MockEnv(reward=0.8, cte=-0.5) # Right of centre
env_left = MockEnv(reward=0.8, cte=+0.5) # Left of centre
wrapped_right = LanePositionWrapper(env_right, target_cte=-0.5)
wrapped_left = LanePositionWrapper(env_left, target_cte=-0.5)
wrapped_right.reset()
wrapped_left.reset()
_, r_right, _, _, _ = wrapped_right.step(np.array([0.0, 0.5]))
_, r_left, _, _, _ = wrapped_left.step(np.array([0.0, 0.5]))
assert r_right > r_left, "Right-of-centre should reward more when target_cte is negative"
# ---- AntiOscillationWrapper Tests ----
def test_no_penalty_on_first_step():
"""No oscillation penalty on the very first step (no previous action)."""
env = MockEnv(reward=0.8)
wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.5)
wrapped.reset()
_, r, _, _, _ = wrapped.step(np.array([1.0, 0.5])) # Large steer — no penalty yet
assert r == pytest.approx(0.8, abs=0.01)
def test_large_steering_change_penalised():
"""Rapid steering reversal should get a penalty."""
env = MockEnv(reward=0.8)
wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.3)
wrapped.reset()
wrapped.step(np.array([-1.0, 0.5])) # Full left
_, r, _, _, _ = wrapped.step(np.array([+1.0, 0.5])) # Full right — delta=2.0
# Penalty = 0.3 * 2.0 = 0.6 → reward = 0.8 - 0.6 = 0.2
assert r < 0.8, "Large steering change should be penalised"
assert r == pytest.approx(0.8 - 0.3 * 2.0, abs=0.05)
def test_no_steering_change_no_penalty():
"""Consistent steering should get no penalty."""
env = MockEnv(reward=0.8)
wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.3)
wrapped.reset()
wrapped.step(np.array([0.3, 0.5]))
_, r, _, _, _ = wrapped.step(np.array([0.3, 0.5])) # Same action — delta=0
assert r == pytest.approx(0.8, abs=0.01)
def test_oscillation_penalty_not_applied_off_track():
"""Off-track (negative reward) should not get oscillation penalty."""
env = MockEnv(reward=-1.0)
wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.5)
wrapped.reset()
wrapped.step(np.array([-1.0, 0.5]))
_, r, _, _, _ = wrapped.step(np.array([+1.0, 0.5])) # Large change, but off-track
assert r == -1.0, "Off-track reward should stay -1.0"
def test_oscillation_score_zero_for_consistent_driving():
"""Constant steering → oscillation score ≈ 0."""
env = MockEnv(reward=0.8)
wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.1)
wrapped.reset()
for _ in range(15):
wrapped.step(np.array([0.2, 0.5])) # Constant steer
assert wrapped.current_oscillation_score() == pytest.approx(0.0, abs=0.01)
# ---- AsymmetricCTEWrapper Tests ----
def test_left_of_centre_penalised():
"""Left of centre (positive CTE) should earn less reward than right."""
env_left = MockEnv(reward=0.8, cte=+1.0)
env_right = MockEnv(reward=0.8, cte=-1.0)
wrapped_left = AsymmetricCTEWrapper(env_left)
wrapped_right = AsymmetricCTEWrapper(env_right)
wrapped_left.reset()
wrapped_right.reset()
_, r_left, _, _, _ = wrapped_left.step(np.array([0.0, 0.5]))
_, r_right, _, _, _ = wrapped_right.step(np.array([0.0, 0.5]))
assert r_right > r_left, "Right-of-centre should reward more than left"
def test_crash_unaffected_by_asymmetric():
"""Crash (reward=-1) should not be modified."""
env = MockEnv(reward=-1.0, cte=+2.0)
wrapped = AsymmetricCTEWrapper(env, left_penalty=0.9)
wrapped.reset()
_, r, _, _, _ = wrapped.step(np.array([0.0, 0.5]))
assert r == -1.0
# ---- CombinedBehavioralWrapper Tests ----
def test_combined_wrapper_gives_positive_reward_on_track():
"""Combined wrapper should give positive reward when on track."""
env = MockEnv(reward=0.8, cte=0.0)
wrapped = CombinedBehavioralWrapper(env, target_cte=0.0, oscillation_penalty=0.0)
wrapped.reset()
_, r, _, _, _ = wrapped.step(np.array([0.0, 0.5]))
assert r > 0
def test_combined_wrapper_crash_still_negative():
"""Crash should remain negative through combined wrapper."""
env = MockEnv(reward=-1.0, cte=0.0)
wrapped = CombinedBehavioralWrapper(env)
wrapped.reset()
_, r, _, _, _ = wrapped.step(np.array([0.0, 0.5]))
assert r < 0